Page MenuHomePhabricator

[OpenMP] Introduce target memory manager
ClosedPublic

Authored by tianshilei1992 on Jun 2 2020, 10:12 PM.

Details

Summary

Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:

  1. The program ends. Obviously. However, there is a little problem that plugin

library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.

  1. Device is out of memory when we request a new memory. The manager will walk

through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tianshilei1992 marked 12 inline comments as done.Aug 12 2020, 7:35 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
108

We might have same nodes both in the table and the free list. It is on purpose because the map relation never changes, which could save us an operation on the map.

150

It is on purpose, and the "race" is not a problem. Think about it. Even we wrap it into a lock and now it is empty, there is still chance that when we move to the next list, one or more nodes are returned to this list. No difference.

274

There is no race because same NodePtr will never go to two threads.

325

Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time.

328

Sure.

openmp/libomptarget/src/MemoryManager.h
27

That attributes to the Pimpl idiom. It is not a good practice to have too much implementation stuffs in the header file.

31

Could remove it.

openmp/libomptarget/src/device.cpp
31

MemoryManager will be initialized separately later. The only reason we need this is std::vector<DeviceTy> requires it. We don't copy or construct those objects afterwards.

369

Yes. The logic is a little weird. I'll refactor this part.

tianshilei1992 marked 9 inline comments as done.Aug 12 2020, 7:35 PM
openmp/libomptarget/src/MemoryManager.cpp
150

That seems to assume list.empty() is an atomic operation. It isn't - calling list.empty() from one thread while another can be inserting into the list is a data race.

We could do something involving a relaxed read followed by a lock followed by another read, in the double checked locking fashion. Uncontended locks are cheap though so it's probably not worthwhile.

194

This seems bad. Perhaps we should call a function to do this work shortly before destroying the target plugin?

274

It looks like PtrToNodeTable can be modified by other threads while this is running. Doesn't matter that NodePtr itself is unique - can't call .find() on the structure while another thread is mutating it.

openmp/libomptarget/src/device.cpp
31

std::vector<DeviceTy> should be content with a move constructor. Then the copy constructor can be = delete.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
150

Double check does not work either. If the empty function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock.

194

That is a potential problem, and actually it might not be a problem. Only when we're going to exit the process can this function be invoked. Even the deallocation will not succeed, GPU memory will be free once the process exits anyway.

274

The iterators will not be invalidated on multiset in insert operation, but anyway, I'm not sure whether it will crash in some middle status, so I'll wrap them into the guard lock.

openmp/libomptarget/src/device.cpp
31

std::mutex cannot be moved. That is the only reason we have the copy constructor.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM

Updated based on comments

openmp/libomptarget/src/MemoryManager.cpp
150

Double check requires an atomic read, though relaxed is fine. Or probably some use of barriers. Calling empty() while another thread modifies the list is the race. Because empty() is not atomic qualified, the race is UB.

Empty probably resolves to loading two pointers and comparing them for equality, so I sympathise with the argument that the race is benign, but it's still prudent to remove the data races we know about. Less UB, and means data race detectors will have a better chance of helping find bugs.

274

A crash would be fine as we'd notice that. It's data corruption due to the race which is the hazard. Thanks for adding it to a locked region.

openmp/libomptarget/src/device.cpp
31

std::mutex can't be copied either. If a new default-initialised mutex is OK as the result of the copy, it would be OK as the result of a move too.

Removed the pimpl and namespace.

tianshilei1992 marked 3 inline comments as done.Aug 18 2020, 4:45 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
31

That is beyond the scope of this patch. Since it already has a user-defined copy operator, then let it be.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 4:45 PM

Fixed comment of constructor

Harbormaster completed remote builds in B68819: Diff 286430.
ye-luo added a comment.EditedAug 18 2020, 6:57 PM

In addition,

  1. the DeviceTy copy constructor and assign operator are imperfect before this patch. I don't think we can fix them in this patch. We should just document the imperfection here.
  2. Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.
openmp/libomptarget/src/MemoryManager.cpp
131

N->Ptr is deleted here. Then the shared_ptr in FreeLists[I] is deleted here but PtrToNodeTable still has the shared_ptr and an address which is no more valid.
If I understand correctly, you want FreeLists holds a subset of PtrToNodeTable memory segments.
I think what you need is

using FreeListTy = std::multiset<std::reference_wrapper<NodeTy>, NodeCmpTy>;
std::unordered_map<void *, NodeTy> PtrToNodeTable;

In this way, PtrToNodeTable is the unique owner of all the memory segments. FreeList only owns a reference.

325

Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time.

Although your current implementation via environment variable cannot specify the size for each device, we may use configuration file in the future to control this. It will be helpful If you can facilitate this when cleaning up the logic for the default value.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 7:42 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
131

This is a nice catch. Thanks for that. Holding a reference will not solve the problem that the node should also be removed from the map. It is same as holding a shared_ptr, but I could in fact use the reference way for the FreeListTy.
The initial implementation is to remove the node from the map table and then add it to the free lists. Later I want to avoid the unnecessary operation on the map table but forget to update here. I’ll fix it.

325

I would prefer to keep current one. If in the future we have request for the per-device threshold, we could change it by the time. This can keep the implementation consistent.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 7:56 PM
  1. Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.

First, users can always opt out the feature. What’s more important, if we receive complaints that this feature causes their applications OOM, we could evaluate it and then make corresponding change. What we know for now is many applications benefit from it.

ye-luo added inline comments.Aug 18 2020, 8:01 PM
openmp/libomptarget/src/MemoryManager.cpp
131

The correct code needs to take care of both PtrToNodeTable and FreeLists regardless.

Currently in the destructor, you first deal with PtrToNodeTable and then FreeLists with some nullptr check.
If you switch to reference in FreeLists, only PtrToNodeTable needs to be taken care.

I still hope you find shared_ptr not needed at all.

tianshilei1992 added inline comments.Aug 18 2020, 8:17 PM
openmp/libomptarget/src/MemoryManager.cpp
131

One benefit to use pointer is that we could use nullptr to tell a state, which is very important to narrow the critical area as much as possible. Reference does not have that quality so that I need to do more things in the critical area which is counter-efficient. I can take the map table as a container of nodes and use the raw pointer in the free lists.

ye-luo added inline comments.Aug 18 2020, 8:27 PM
openmp/libomptarget/src/MemoryManager.cpp
131

Please don’t use raw pointers. If you look at reference_wrapper it has the same cost as taking the address and store the address. C++ guru invented that for us in a safe way.

ye-luo added inline comments.Aug 18 2020, 8:47 PM
openmp/libomptarget/src/MemoryManager.cpp
220

When arrive here, the code should know if the memory is from free list or newly allocated. It doesn’t even need to do the find. It is wasting time. We may just use std::list if we don’t need to find.

Removed all shared_ptr stuffs and fixed one potential issue

tianshilei1992 marked 4 inline comments as done.Aug 19 2020, 11:37 AM
ye-luo added inline comments.Aug 19 2020, 12:34 PM
openmp/libomptarget/src/MemoryManager.cpp
215

Use emplace and its return value iterator to avoid the later lookup(at).

235

I don't what the policy of using auto. auto makes the code cleaner.
There are a few similar places with iterators.

openmp/libomptarget/src/device.cpp
419

Prefer

else
  return RTL->data_delete(RTLDeviceID, TgtPtrBegin);

the same change to RTL->data_alloc above

Updated based on review comments

tianshilei1992 marked 3 inline comments as done.Aug 19 2020, 1:05 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
419

It's a code style preference. I would go with "no else after return".

tianshilei1992 marked an inline comment as done.Aug 19 2020, 1:05 PM
ye-luo accepted this revision.Aug 19 2020, 1:17 PM

LGTM

This revision is now accepted and ready to land.Aug 19 2020, 1:17 PM

Fixed the build issue when OMPTARGET_DEBUG is not defined

Fixed the clang-tidy warning llvm-header-guard

Change the header guard to make clang-tidy happy

This revision was automatically updated to reflect the committed changes.

Since I spent hours to hunt down several race conditions in libomp in the last months, please fix races immediately, when they are pointed out. There is no such thing as a benign race!

openmp/libomptarget/src/MemoryManager.cpp
150

Double check does not work either. If the empty function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock.

double check is used to solve race condition not data race. data race is UB and must be avoided. race condition is not UB and might be accepted (benign), but can also break the code - especially reference counting.
To avoid the data race, as Jon said, you should use atomics. You might want to add an atomic counter to avoid the use of non-atomic List.empty().

When using double-checking, you need to perform all changes under lock (inserting to the list must be done under the same lock). All related double-checks occur under the same lock. In this case, the issue you tried to make can not occur.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

If it is a thread-safety issue, adding mutex in out facing allocate and free should make the code safe while investigating the root cause.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

Yeah, issuing a bug would be nice because at least I could get a reproducer. ;-) BTW, all data race mentioned by others were guarded by lock actually.

openmp/libomptarget/src/MemoryManager.cpp
88

This "little issue" of calling into the target plugin after it has been destroyed is a contender for this patch not working on amdgpu.

I still think the target plugin, if it wishes to use this allocator, should hold the state itself. That means the allocator can be used internally, e.g. for call frames or the parallel region malloc, as well making destruction order straightforward and correct.

protze.joachim added a comment.EditedNov 2 2020, 2:32 PM

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

Cannot reproduce the failure on my side

I tested this with older clang releases (at least back to clang 9.0) and could reproduce the assertion. The error doesn't seem to be related to this patch, but the test just reveals the issue.

I could reduce the issue to:

#include <omp.h>
#include <cassert>
#include <iostream>
#define N 10

int main(int argc, char *argv[]) {
#pragma omp parallel for num_threads(4)
  for (int i = 0; i < 16; ++i) {
    int buffer[N];
    printf("i=%i, n=%i, buffer=%p\n",i,N,buffer);
#pragma omp critical
#pragma omp target teams distribute parallel for              \
    map(from                                                  \
        : buffer)
    for (int j = 0; j < N; ++j) {
      buffer[j] = i;
    }
    for (int j = 0; j < N; ++j) {
      if(buffer[j] != i){
        printf("buffer[j=%i]=%i != i=%i, buffer=%p\n",j,buffer[j],i,buffer);
        assert(buffer[j] == i);
      }
    }
  }
  std::cout << "PASS\n";
  return 0;
}

So I think, that the map(from) fails when executed from multiple threads. The issue goes away, if the initial test is executed with OMP_NUM_THREADS=1. Adding the critical does not solve the issue. So, I don't think that a race in libomptarget is causing the issue.

tcramer added a subscriber: tcramer.Dec 9 2020, 9:49 AM
protze.joachim added inline comments.Dec 10 2020, 8:44 AM
openmp/libomptarget/src/MemoryManager.cpp
88

@tianshilei1992 Any plan to fix this?
This does not only break for AMD, but also for a plugin our group is working on.

Without understanding all the details, I think, the destructor of DeviceTy should delete the MemoryManager?
Would this solve the issue? I.e. is the DeviceTy destroyed before the target plugin is unloaded?

protze.joachim added inline comments.Dec 10 2020, 8:53 AM
openmp/libomptarget/src/MemoryManager.cpp
88

Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded?

JonChesterfield added a comment.EditedDec 10 2020, 10:01 AM

I think I volunteered to fix the global constructor/destructor hazard, then forgot about it.

My intent is to add functions to the plugin:

some_enum __tgt_rtl_plugin_init(void);
some_enum __tgt_rtl_plugin_dtor(void);

with the invariant that plugin_init is the first function called on a given plugin, and plugin_dtor is the last function called. Probably also that init, dtor are called at most once, and the dtor is called exactly once if init is called.

The initialization that currently occurs for global variables in the plugin can then optionally be done in the init call. Libomptarget shall destroy the memory manager before calling dtor, so that it can make calls into the plugin during the destruction.

This doesn't address multiple instances of a given plugin, but also doesn't preclude it. Any plugin that doesn't implement these, won't have them called.

edit: However, I don't think libomptarget knows when a given plugin is no longer in use. There's a TODO in rtl.cpp about removing a RTL if it's not used any more, but I can't see how that can be derived reliably from calls into interface.cpp.

edit2: If we move LoadRTLs out of the first call to RegisterLib and into init() or the PluginManager constructor, then we can move some unloading logic out of UnregisterLib and call that from deinit(), at which point we'll have a good place to put the teardown,

I'm surprised to find no dlclose matching the dlopen. Instead of calling some function for init/destroy, can't we just use library constructor/destructors in the plugin? All MemoryManagers for a plugin should then be destroyed before the plugin is explicitly dlclosed.
I'm also surprised that LoadRTLs does not dlclose the library in case of missing symbols.

manorom added inline comments.
openmp/libomptarget/src/MemoryManager.cpp
88

Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded?

Hope I'm not too late to the party, but:

  • If I tracked this down correctly, plugins don't really get unloaded explicitly but only when the host program terminates and the program and its libraries get unloaded by OS.
  • Plugins keep ther state in global objects so their destructor is called when the plugin library is unloaded (at least thats when the VE plugin cleans up its resources, including its target memory).
  • The MemoryManager is (ultimately) owned by the PluginManger which gets constructed explicitly by __attribute__((constructor)) and __attribute__((destructor)) functions in rtl.cpp

So what I guess happens is, that the host program terminates, and then all global destructors are exeuted including those in libomptarget and the plugin libraries (before any library actually unloads). And the destructor which is called first happens to be the destrutor for the plugin library and the destructor function which deletes the PluginManger gets called later.

This should be disabled on non-cuda platforms. It is presently a performance improvement on cuda, might improve or regress performance on others, and has a call method on dead object bug that has been open for months.

In particular I don't think it helps performance on amdgpu and it's annoying to set an environment variable to suppress a known bug.

I’m going to put the issue on the top of my list.

JonChesterfield added a comment.EditedThu, Jan 7, 6:56 AM

I’m going to put the issue on the top of my list.

Nice! Thank you.

I was thinking of adding an optional function to the plugin api, bool (*enable_memory_manager)(void) or similar, which defaults to return false; if not implemented. It seems the amd internal branch currently has an #if 0 around the entry point to avoid checking an environment variable, but I'd really like to get rid of that local patch.

The fix is on Phab now. Please refer to D94256 for more details.