This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Introduce target memory manager
ClosedPublic

Authored by tianshilei1992 on Jun 2 2020, 10:12 PM.

Details

Summary

Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:

  1. The program ends. Obviously. However, there is a little problem that plugin

library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.

  1. Device is out of memory when we request a new memory. The manager will walk

through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Updated according to comments

tianshilei1992 marked 6 inline comments as done.Aug 5 2020, 12:32 PM

Updated the calculation of NumBuckets

tianshilei1992 marked an inline comment as done.Aug 5 2020, 12:39 PM

Use const_iterator

Make mutex close to their protected variables

Removed the plugin interface

tianshilei1992 edited the summary of this revision. (Show Details)Aug 10 2020, 6:37 PM
jdoerfert accepted this revision.Aug 10 2020, 7:02 PM

Some comments and nits you should take under consideration.

I'm not 100% sold on the list design, that we look for the exact size, and that we traverse the list while we look.
However, this is improving a lot over the status quo and we can revisit this with more profiling information later.

LGTM

Nits:
I'd rename memory.h into MemoryManager.h if we don't expect anything else to go in there that is not "a memory manager" at the end of the day. Same with the cpp.
I'm not sure we need the memory namespace, or the impl namespace for that matter.

openmp/libomptarget/src/memory.cpp
25

The last sentence is now obsolete. I'd just state that there is an environment variable to set the threshold for which we will manage allocations. Please actually put the name of the variable here ;).

69

Can we rename flp2 into findPreviousPowerOfTwo or something similarly descriptive?

158
250

This pattern occurs at least twice, might be worth to put it in a helper method, e.g., allocateOrFreeAndAllocate for the lack of a better name ;)

254

This message should be more descriptive I guess. "Return nullptr" is not helpful. Maybe spell out that we failed to allocate the requested memory, the device might be OOM. I guess this is also a good spot for some debugger events eventually...

This revision is now accepted and ready to land.Aug 10 2020, 7:02 PM
jdoerfert added inline comments.Aug 10 2020, 7:02 PM
openmp/libomptarget/plugins/exports
22

leftover.

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

Could it be moved under the cuda subdirectory, until another plugin wishes to use it? That means the logic for detecting if it's in use and corresponding API disappear for now.

tianshilei1992 marked an inline comment as done.

Update based on comments

Fixed compilation error

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

Right. We are working on that and if it turns out to be always superior we can move to that model. So far, this model is superior to what we had, by a lot.

Could it be moved under the cuda subdirectory, until another plugin wishes to use it? That means the logic for detecting if it's in use and corresponding API disappear for now.

This is *not* CUDA specific at all, please do not move generic things into target sub-directories, that is counterproductive. If we have another plugin that want to opt-out/in, we can have hooks for that. As there is non we support right now, hooks are added on-demand later.

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

The memory manager is not an allocator. We do need the mutex for the thread safety. I can't figure out a better way not to use the "list", which is a std::multiset here for efficient look up based on the size. Bump allocator is in another patch.

Some comments and nits you should take under consideration.

I'm not 100% sold on the list design, that we look for the exact size, and that we traverse the list while we look.

The "list" is not a real list. It is a std::multiset here. So basically its look up complexity is O(logn) on average. If we don't have such a thing, what would be a better way to organize those free nodes with different sizes?

I'd rename memory.h into MemoryManager.h if we don't expect anything else to go in there that is not "a memory manager" at the end of the day. Same with the cpp.

Done.

I'm not sure we need the memory namespace, or the impl namespace for that matter.

I prefer to leave the namespace. Current implementation of libomptarget has really poor code style. This is a totally new file. I hope to make it right from it.

Deleted unnecessary changes

OK, cool. If we're open to changing the implementation later this is fine by me. An instance per host thread is likely to be better than all the internal locks. Couple of minor comments above.

There are use cases for allocating device memory within the plugin itself. I think including MemoryManager.h from within the plugin would work for that.

openmp/libomptarget/src/MemoryManager.h
27 ↗(On Diff #284877)

Can we drop the shared_ptr here? Better to have the MemoryManager move-only and use unique_ptr

37 ↗(On Diff #284877)

Deallocate taking a size usually allows a faster implementation, but that can be left until said faster implementation is proposed

OK, cool. If we're open to changing the implementation later this is fine by me.

Always!

An instance per host thread is likely to be better than all the internal locks.

That is one of the things we can profile and change, no objection if it turns out problematic.

JonChesterfield accepted this revision.Aug 12 2020, 8:47 AM

LGTM then. Calling into the plugin to do the bulk alloc/free is nice.

Using std::unique_ptr for the Pimpl

tianshilei1992 marked 2 inline comments as done.Aug 12 2020, 9:19 AM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.h
37 ↗(On Diff #284877)

I agree. Currently the plugin interface does not have such argument so we don't need that. In the future we might add that.

tianshilei1992 marked an inline comment as done.Aug 12 2020, 9:19 AM

There are use cases for allocating device memory within the plugin itself. I think including MemoryManager.h from within the plugin would work for that.

Unluckily, it doesn't work because it has a DeviceTy object…We might have common things such that all plugins can share in the future.

Using std::multiset::find instead of std::find_if for better performance

Updated some comments

  1. Please mention LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD, default value and unit in the patch summary.
  2. Is it possible to have a unit test testing the manager class behaviors?
  3. Can we offload to host and run address sanitizer or valgrind?

I'm not sure if I'm asking for too much here.

openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

SizeThreshold is global while Threshold is local. The default values is also different. I'm lost in the logic here.

openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

Why is the pointer needed?
What is the design logic behind MemoryManagerTy and MemoryManagerImplTy layers? Can we just have one?

openmp/libomptarget/src/device.h
150

Could you explain why shared_ptr is needed?

tianshilei1992 marked 3 inline comments as done.Aug 12 2020, 10:11 AM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

Yeah, you're lost. By default, Threshold is 0, which means we will not overwrite SizeThreshold.

openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others.

openmp/libomptarget/src/device.h
150

Such that I don't need to include MemoryManager.h in the header, and it doesn't hurt anything.

tianshilei1992 marked 3 inline comments as done.Aug 12 2020, 10:11 AM
ye-luo requested changes to this revision.Aug 12 2020, 10:11 AM

Block the patch temporarily for my earlier questions.

This revision now requires changes to proceed.Aug 12 2020, 10:11 AM
  1. Please mention LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD, default value and unit in the patch summary.

Sure. Will do.

  1. Is it possible to have a unit test testing the manager class behaviors?

I don't think so. We don't have the "unit test" framework you want. If you insist some tests, I could add a simple "feature" test here.

  1. Can we offload to host and run address sanitizer or valgrind?

What do you mean by offload to host? This memory manager will not be used by the device.

tianshilei1992 edited the summary of this revision. (Show Details)Aug 12 2020, 10:16 AM
ye-luo added inline comments.Aug 12 2020, 10:22 AM
openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

That is the job of header and cpp files.

openmp/libomptarget/src/device.h
150

This is obviously a wrong way. Move the constructor and destructor to cpp.

JonChesterfield added a comment.EditedAug 12 2020, 10:22 AM

It definitely can and should be tested. Instantiate on a device that uses host malloc/free for the functions and stress test it under valgrind.

I've started writing tests out of tree for stuff like this, which is not ideal, but means the code shipped without the tests is likely to be correct

openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

No. You can refer to https://en.cppreference.com/w/cpp/language/pimpl for more details.

openmp/libomptarget/src/device.h
150

Why is it a wrong way? Is there any drawback?

It definitely can and should be tested. Instantiate on a device that uses host malloc/free for the functions and stress test it under valgrind.

The "unit test" Ye mentions is not the one you said here. I agree to add a test like you said and I will. The "unit test" Ye wants is to test the class MemoryManagerTy directly, which is currently not feasible. We don't have a test framework to support that.

Improved performance by removing one map table operation

Added a new test

Replaced std::shared_ptr with std::unique_ptr in the class DeviceTy

ye-luo added inline comments.Aug 12 2020, 4:49 PM
openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others.

Where else do you have in mind this header will be included? So far there is only device.cpp.

openmp/libomptarget/src/device.cpp
32

Why do you think it is OK here leaving the copy constructor always setting MemoryManager nullptr? This cause surprises. The same question applies to assign operator as well.

ye-luo added inline comments.Aug 12 2020, 4:54 PM
openmp/libomptarget/src/MemoryManager.cpp
149 ↗(On Diff #285188)

There can be race when you test List.empty().

273 ↗(On Diff #285188)

There can be race in PtrToNodeTable when you find()

ye-luo added inline comments.Aug 12 2020, 5:09 PM
openmp/libomptarget/src/MemoryManager.cpp
327 ↗(On Diff #285188)

make_unique is better.

324 ↗(On Diff #285107)

Q1. Why SizeThreshold is not per device?
Q2. I was asking for a way to opt-out this optimization. But you ignore LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=0

openmp/libomptarget/src/device.cpp
330

I think this is your real default. The default value of SizeThreshold always gets overwritten.

ye-luo added inline comments.Aug 12 2020, 5:31 PM
openmp/libomptarget/src/MemoryManager.cpp
107 ↗(On Diff #285188)

Another shared_ptr. See typedef std::set<HostDataToTargetTy, std::less<>> HostDataToTargetListTy; as an example. There doesn't seem to need a pointer wrapping NodeTy.

ye-luo added inline comments.Aug 12 2020, 6:23 PM
openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

Remove Q2. Opt-out has been supported.

openmp/libomptarget/src/MemoryManager.h
30 ↗(On Diff #285188)

Second (Third?) place with a default. Remove or error out if size 0?

tianshilei1992 marked 12 inline comments as done.Aug 12 2020, 7:35 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time.

107 ↗(On Diff #285188)

We might have same nodes both in the table and the free list. It is on purpose because the map relation never changes, which could save us an operation on the map.

149 ↗(On Diff #285188)

It is on purpose, and the "race" is not a problem. Think about it. Even we wrap it into a lock and now it is empty, there is still chance that when we move to the next list, one or more nodes are returned to this list. No difference.

273 ↗(On Diff #285188)

There is no race because same NodePtr will never go to two threads.

327 ↗(On Diff #285188)

Sure.

openmp/libomptarget/src/MemoryManager.h
26 ↗(On Diff #285107)

That attributes to the Pimpl idiom. It is not a good practice to have too much implementation stuffs in the header file.

30 ↗(On Diff #285188)

Could remove it.

openmp/libomptarget/src/device.cpp
32

MemoryManager will be initialized separately later. The only reason we need this is std::vector<DeviceTy> requires it. We don't copy or construct those objects afterwards.

330

Yes. The logic is a little weird. I'll refactor this part.

tianshilei1992 marked 9 inline comments as done.Aug 12 2020, 7:35 PM
openmp/libomptarget/src/MemoryManager.cpp
149 ↗(On Diff #285188)

That seems to assume list.empty() is an atomic operation. It isn't - calling list.empty() from one thread while another can be inserting into the list is a data race.

We could do something involving a relaxed read followed by a lock followed by another read, in the double checked locking fashion. Uncontended locks are cheap though so it's probably not worthwhile.

193 ↗(On Diff #285188)

This seems bad. Perhaps we should call a function to do this work shortly before destroying the target plugin?

273 ↗(On Diff #285188)

It looks like PtrToNodeTable can be modified by other threads while this is running. Doesn't matter that NodePtr itself is unique - can't call .find() on the structure while another thread is mutating it.

openmp/libomptarget/src/device.cpp
32

std::vector<DeviceTy> should be content with a move constructor. Then the copy constructor can be = delete.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
149 ↗(On Diff #285188)

Double check does not work either. If the empty function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock.

193 ↗(On Diff #285188)

That is a potential problem, and actually it might not be a problem. Only when we're going to exit the process can this function be invoked. Even the deallocation will not succeed, GPU memory will be free once the process exits anyway.

273 ↗(On Diff #285188)

The iterators will not be invalidated on multiset in insert operation, but anyway, I'm not sure whether it will crash in some middle status, so I'll wrap them into the guard lock.

openmp/libomptarget/src/device.cpp
32

std::mutex cannot be moved. That is the only reason we have the copy constructor.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM

Updated based on comments

openmp/libomptarget/src/MemoryManager.cpp
149 ↗(On Diff #285188)

Double check requires an atomic read, though relaxed is fine. Or probably some use of barriers. Calling empty() while another thread modifies the list is the race. Because empty() is not atomic qualified, the race is UB.

Empty probably resolves to loading two pointers and comparing them for equality, so I sympathise with the argument that the race is benign, but it's still prudent to remove the data races we know about. Less UB, and means data race detectors will have a better chance of helping find bugs.

273 ↗(On Diff #285188)

A crash would be fine as we'd notice that. It's data corruption due to the race which is the hazard. Thanks for adding it to a locked region.

openmp/libomptarget/src/device.cpp
32

std::mutex can't be copied either. If a new default-initialised mutex is OK as the result of the copy, it would be OK as the result of a move too.

Removed the pimpl and namespace.

tianshilei1992 marked 3 inline comments as done.Aug 18 2020, 4:45 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
32

That is beyond the scope of this patch. Since it already has a user-defined copy operator, then let it be.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 4:45 PM

Fixed comment of constructor

Harbormaster completed remote builds in B68819: Diff 286430.
ye-luo added a comment.EditedAug 18 2020, 6:57 PM

In addition,

  1. the DeviceTy copy constructor and assign operator are imperfect before this patch. I don't think we can fix them in this patch. We should just document the imperfection here.
  2. Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.
openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time.

Although your current implementation via environment variable cannot specify the size for each device, we may use configuration file in the future to control this. It will be helpful If you can facilitate this when cleaning up the logic for the default value.

130 ↗(On Diff #286432)

N->Ptr is deleted here. Then the shared_ptr in FreeLists[I] is deleted here but PtrToNodeTable still has the shared_ptr and an address which is no more valid.
If I understand correctly, you want FreeLists holds a subset of PtrToNodeTable memory segments.
I think what you need is

using FreeListTy = std::multiset<std::reference_wrapper<NodeTy>, NodeCmpTy>;
std::unordered_map<void *, NodeTy> PtrToNodeTable;

In this way, PtrToNodeTable is the unique owner of all the memory segments. FreeList only owns a reference.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 7:42 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
324 ↗(On Diff #285107)

I would prefer to keep current one. If in the future we have request for the per-device threshold, we could change it by the time. This can keep the implementation consistent.

130 ↗(On Diff #286432)

This is a nice catch. Thanks for that. Holding a reference will not solve the problem that the node should also be removed from the map. It is same as holding a shared_ptr, but I could in fact use the reference way for the FreeListTy.
The initial implementation is to remove the node from the map table and then add it to the free lists. Later I want to avoid the unnecessary operation on the map table but forget to update here. I’ll fix it.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 7:56 PM
  1. Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.

First, users can always opt out the feature. What’s more important, if we receive complaints that this feature causes their applications OOM, we could evaluate it and then make corresponding change. What we know for now is many applications benefit from it.

ye-luo added inline comments.Aug 18 2020, 8:01 PM
openmp/libomptarget/src/MemoryManager.cpp
130 ↗(On Diff #286432)

The correct code needs to take care of both PtrToNodeTable and FreeLists regardless.

Currently in the destructor, you first deal with PtrToNodeTable and then FreeLists with some nullptr check.
If you switch to reference in FreeLists, only PtrToNodeTable needs to be taken care.

I still hope you find shared_ptr not needed at all.

tianshilei1992 added inline comments.Aug 18 2020, 8:17 PM
openmp/libomptarget/src/MemoryManager.cpp
130 ↗(On Diff #286432)

One benefit to use pointer is that we could use nullptr to tell a state, which is very important to narrow the critical area as much as possible. Reference does not have that quality so that I need to do more things in the critical area which is counter-efficient. I can take the map table as a container of nodes and use the raw pointer in the free lists.

ye-luo added inline comments.Aug 18 2020, 8:27 PM
openmp/libomptarget/src/MemoryManager.cpp
130 ↗(On Diff #286432)

Please don’t use raw pointers. If you look at reference_wrapper it has the same cost as taking the address and store the address. C++ guru invented that for us in a safe way.

ye-luo added inline comments.Aug 18 2020, 8:47 PM
openmp/libomptarget/src/MemoryManager.cpp
219 ↗(On Diff #286432)

When arrive here, the code should know if the memory is from free list or newly allocated. It doesn’t even need to do the find. It is wasting time. We may just use std::list if we don’t need to find.

Removed all shared_ptr stuffs and fixed one potential issue

tianshilei1992 marked 4 inline comments as done.Aug 19 2020, 11:37 AM
ye-luo added inline comments.Aug 19 2020, 12:34 PM
openmp/libomptarget/src/MemoryManager.cpp
214 ↗(On Diff #286621)

Use emplace and its return value iterator to avoid the later lookup(at).

234 ↗(On Diff #286621)

I don't what the policy of using auto. auto makes the code cleaner.
There are a few similar places with iterators.

openmp/libomptarget/src/device.cpp
384

Prefer

else
  return RTL->data_delete(RTLDeviceID, TgtPtrBegin);

the same change to RTL->data_alloc above

Updated based on review comments

tianshilei1992 marked 3 inline comments as done.Aug 19 2020, 1:05 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
384

It's a code style preference. I would go with "no else after return".

tianshilei1992 marked an inline comment as done.Aug 19 2020, 1:05 PM
ye-luo accepted this revision.Aug 19 2020, 1:17 PM

LGTM

This revision is now accepted and ready to land.Aug 19 2020, 1:17 PM

Fixed the build issue when OMPTARGET_DEBUG is not defined

Fixed the clang-tidy warning llvm-header-guard

Change the header guard to make clang-tidy happy

This revision was automatically updated to reflect the committed changes.

Since I spent hours to hunt down several race conditions in libomp in the last months, please fix races immediately, when they are pointed out. There is no such thing as a benign race!

openmp/libomptarget/src/MemoryManager.cpp
149 ↗(On Diff #285188)

Double check does not work either. If the empty function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock.

double check is used to solve race condition not data race. data race is UB and must be avoided. race condition is not UB and might be accepted (benign), but can also break the code - especially reference counting.
To avoid the data race, as Jon said, you should use atomics. You might want to add an atomic counter to avoid the use of non-atomic List.empty().

When using double-checking, you need to perform all changes under lock (inserting to the list must be done under the same lock). All related double-checks occur under the same lock. In this case, the issue you tried to make can not occur.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

If it is a thread-safety issue, adding mutex in out facing allocate and free should make the code safe while investigating the root cause.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

Yeah, issuing a bug would be nice because at least I could get a reproducer. ;-) BTW, all data race mentioned by others were guarded by lock actually.

openmp/libomptarget/src/MemoryManager.cpp
88 ↗(On Diff #286705)

This "little issue" of calling into the target plugin after it has been destroyed is a contender for this patch not working on amdgpu.

I still think the target plugin, if it wishes to use this allocator, should hold the state itself. That means the allocator can be used internally, e.g. for call frames or the parallel region malloc, as well making destruction order straightforward and correct.

protze.joachim added a comment.EditedNov 2 2020, 2:32 PM

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

Cannot reproduce the failure on my side

I tested this with older clang releases (at least back to clang 9.0) and could reproduce the assertion. The error doesn't seem to be related to this patch, but the test just reveals the issue.

I could reduce the issue to:

#include <omp.h>
#include <cassert>
#include <iostream>
#define N 10

int main(int argc, char *argv[]) {
#pragma omp parallel for num_threads(4)
  for (int i = 0; i < 16; ++i) {
    int buffer[N];
    printf("i=%i, n=%i, buffer=%p\n",i,N,buffer);
#pragma omp critical
#pragma omp target teams distribute parallel for              \
    map(from                                                  \
        : buffer)
    for (int j = 0; j < N; ++j) {
      buffer[j] = i;
    }
    for (int j = 0; j < N; ++j) {
      if(buffer[j] != i){
        printf("buffer[j=%i]=%i != i=%i, buffer=%p\n",j,buffer[j],i,buffer);
        assert(buffer[j] == i);
      }
    }
  }
  std::cout << "PASS\n";
  return 0;
}

So I think, that the map(from) fails when executed from multiple threads. The issue goes away, if the initial test is executed with OMP_NUM_THREADS=1. Adding the critical does not solve the issue. So, I don't think that a race in libomptarget is causing the issue.

tcramer added a subscriber: tcramer.Dec 9 2020, 9:49 AM
protze.joachim added inline comments.Dec 10 2020, 8:44 AM
openmp/libomptarget/src/MemoryManager.cpp
88 ↗(On Diff #286705)

@tianshilei1992 Any plan to fix this?
This does not only break for AMD, but also for a plugin our group is working on.

Without understanding all the details, I think, the destructor of DeviceTy should delete the MemoryManager?
Would this solve the issue? I.e. is the DeviceTy destroyed before the target plugin is unloaded?

protze.joachim added inline comments.Dec 10 2020, 8:53 AM
openmp/libomptarget/src/MemoryManager.cpp
88 ↗(On Diff #286705)

Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded?

JonChesterfield added a comment.EditedDec 10 2020, 10:01 AM

I think I volunteered to fix the global constructor/destructor hazard, then forgot about it.

My intent is to add functions to the plugin:

some_enum __tgt_rtl_plugin_init(void);
some_enum __tgt_rtl_plugin_dtor(void);

with the invariant that plugin_init is the first function called on a given plugin, and plugin_dtor is the last function called. Probably also that init, dtor are called at most once, and the dtor is called exactly once if init is called.

The initialization that currently occurs for global variables in the plugin can then optionally be done in the init call. Libomptarget shall destroy the memory manager before calling dtor, so that it can make calls into the plugin during the destruction.

This doesn't address multiple instances of a given plugin, but also doesn't preclude it. Any plugin that doesn't implement these, won't have them called.

edit: However, I don't think libomptarget knows when a given plugin is no longer in use. There's a TODO in rtl.cpp about removing a RTL if it's not used any more, but I can't see how that can be derived reliably from calls into interface.cpp.

edit2: If we move LoadRTLs out of the first call to RegisterLib and into init() or the PluginManager constructor, then we can move some unloading logic out of UnregisterLib and call that from deinit(), at which point we'll have a good place to put the teardown,

I'm surprised to find no dlclose matching the dlopen. Instead of calling some function for init/destroy, can't we just use library constructor/destructors in the plugin? All MemoryManagers for a plugin should then be destroyed before the plugin is explicitly dlclosed.
I'm also surprised that LoadRTLs does not dlclose the library in case of missing symbols.

manorom added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
88 ↗(On Diff #286705)

Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded?

Hope I'm not too late to the party, but:

  • If I tracked this down correctly, plugins don't really get unloaded explicitly but only when the host program terminates and the program and its libraries get unloaded by OS.
  • Plugins keep ther state in global objects so their destructor is called when the plugin library is unloaded (at least thats when the VE plugin cleans up its resources, including its target memory).
  • The MemoryManager is (ultimately) owned by the PluginManger which gets constructed explicitly by __attribute__((constructor)) and __attribute__((destructor)) functions in rtl.cpp

So what I guess happens is, that the host program terminates, and then all global destructors are exeuted including those in libomptarget and the plugin libraries (before any library actually unloads). And the destructor which is called first happens to be the destrutor for the plugin library and the destructor function which deletes the PluginManger gets called later.

This should be disabled on non-cuda platforms. It is presently a performance improvement on cuda, might improve or regress performance on others, and has a call method on dead object bug that has been open for months.

In particular I don't think it helps performance on amdgpu and it's annoying to set an environment variable to suppress a known bug.

I’m going to put the issue on the top of my list.

JonChesterfield added a comment.EditedJan 7 2021, 6:56 AM

I’m going to put the issue on the top of my list.

Nice! Thank you.

I was thinking of adding an optional function to the plugin api, bool (*enable_memory_manager)(void) or similar, which defaults to return false; if not implemented. It seems the amd internal branch currently has an #if 0 around the entry point to avoid checking an environment variable, but I'd really like to get rid of that local patch.

The fix is on Phab now. Please refer to D94256 for more details.