Page MenuHomePhabricator

[OpenMP] Introduce target memory manager
Needs ReviewPublic

Authored by tianshilei1992 on Jun 2 2020, 10:12 PM.

Details

Summary

Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:

  1. The program ends. Obviously. However, there is a little problem that plugin

library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.

  1. Device is out of memory when we request a new memory. The manager will walk

through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Update:
A threshold is set such that users could control what size of memory will be
managed by the manager. A new plugin interface is also added to provide an
option for the target device to disable the memory manager if the device lib
already provides similar mechanism.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We definitely want faster memory allocation on the target. This is an interesting piece for that.

This patch implements a memory pool on top of Device->RTL->data_alloc. It's interesting that there's a performance hazard with cuda there. The hsa layer amdgpu builds this on has memory pools around the kernel allocation, so I'm not sure this would be of direct benefit for amdgpu.

Memory allocators are notoriously difficult to implement efficiently in the face of unknown workload. Can you share benchmarks that lead to this design?

Similarly they're really easy to get wrong. There's a lot of subtle arithmetic in this patch. It would be prudent to cover this with tests, e.g. stub out the RTL->data_alloc calls for malloc/free and run targeted and fuzz tests under valgrind.

A second part of this puzzle is device side memory pools, so that malloc/free from a target region doesn't (always) have to call into the host. That may end up being quite platform dependent. That seems orthogonal to this patch.

openmp/libomptarget/src/memory.cpp
33

This maps:
0 -> 0
1 -> 1
2 -> 2
3 -> 2
4 -> 4
which is not the previous power of two. Round down to a power of two could be:
x < 2 ? x : 1 << (31 - __builtin_clz(x - 1))

57

Tests for this arithmetic?

72

Why list over smallvector? I can't see a need for iterator stability here

95

This is quadratic - each pass around the loop walks through each node of the list

112

LLVM is built with exceptions disabled, so probably shouldn't raise here

jdoerfert added inline comments.Jun 3 2020, 9:00 AM
openmp/libomptarget/src/memory.cpp
112

This is the runtime, so exceptions "would work". However, no exceptions please. There is no defined interface and no reason to believe the user has a C++ exception handler waiting.

tianshilei1992 added inline comments.Jun 3 2020, 7:20 PM
openmp/libomptarget/src/memory.cpp
33

That is actually what my expectation. This function is for a number that is not a power of 2. The comment is not accurate, and I'll update it.
The intention here is to distribute different buffers to different buckets based on its previous power of two. For example, 1024, 1025, 1100, 2000 will all go to the bucket with size 1024.

72

Any good suggestion? I also think this style is a little weird, but cannot find a better one.

95

In the worse case, yes. The worse case is equivalent to release all free buffers. That's why this procedure starts from the bucket with largest size. Each time we release one buffer, we will try allocation once, until the allocation succeeds.

tianshilei1992 added a comment.EditedJun 3 2020, 7:34 PM

Thank you Jon for the review! The comments are really precious.

Memory allocators are notoriously difficult to implement efficiently in the face of unknown workload. Can you share benchmarks that lead to this design?

The benchmark is quite simple:

#pragma omp parallel for
for (int i = 0; i < 4096; ++i) {
#pragma omp target team distribute map(...)
  { /* kernel here */ }
}

We have 4096 tasks, and depending on the number of threads N, we have N target region offloaded almost at the same time such that M of them might be executing simultaneously. For each kernel, the RT will allocate memory for it before its execution and free the memory after the execution. From NVVP, we observed that cuMemFree is very expensive, especially the computation is light but depends on a large amount of memory. What's more, during the cuMemFree, there is no context switch on the device, even though there are actually multiple kernels executing at the same time. From the profiling result of IBM XL OpenMP, we found that they don't call cuMemFree after each execution, and that is why we're thinking to have a memory pool.

Similarly they're really easy to get wrong. There's a lot of subtle arithmetic in this patch. It would be prudent to cover this with tests, e.g. stub out the RTL->data_alloc calls for malloc/free and run targeted and fuzz tests under valgrind.

That sounds reasonable. Will do it.

A second part of this puzzle is device side memory pools, so that malloc/free from a target region doesn't (always) have to call into the host. That may end up being quite platform dependent. That seems orthogonal to this patch.

That part should be covered in the plugin which is currently not the focus of this patch. But maybe we could avoid doing memory allocation and free in plugin I guess.

ye-luo added a subscriber: ye-luo.EditedJun 18 2020, 3:02 PM

I think this optimization can be an option but not replacing the existing scheme directly allocate/free memory.
Application may request device memory outside openmp and use vendor native programming model or libraries.
Having libomptarget holding large memory doesn't make sense.
You may consider using the pool only for very small allocation requests <1M.
It is application's responsibility to take care of large memory allocation.

I think this optimization can be an option but not replacing the existing scheme directly allocate/free memory.
Application may request device memory outside openmp and use vendor native programming model or libraries.
Having libomptarget holding large memory doesn't make sense.
You may consider using the pool only for very small allocation requests <1M.
It is application's responsibility to take care of large memory allocation.

I agree with you that this optimization could be optional, enabled/disabled by an environment variable maybe, such that power users can still take care of everything on their own, but I don’t think every user is power user, and would use OpenMP offloading in an expert’s way that allocating memory using device RTL functions and then use them directly. They might still use interfaces provided by OpenMP to allocate device memory.

Having a allocation size limit may work. Below MAXSIZE, got to manager. Above MAXSIZE, go directly.
Power user can do even MAXSIZE=0 to fully skip the manager if they want.

Having a allocation size limit may work. Below MAXSIZE, got to manager. Above MAXSIZE, go directly.
Power user can do even MAXSIZE=0 to fully skip the manager if they want.

I like that idea.
By default I want some memory management for people and the smaller the allocations the more important it is. That said, opting out should always be an option.

@tianshilei1992 please also make sure that the runtime shutdown routines will free this memory.

I happened to find that the huge overhead of cuMemFree might be due to the fact that it is called when the data is still being used. Will come back to this patch after I fix the issue and re-evaluate whether we still need this.

cdaley added a subscriber: cdaley.Wed, Jul 22, 4:09 PM

I left a bunch of comments below, from minor nits and style things to design suggestions. I think high-level there are a few things:

  • We want to manage device memory for allocations up to as user defined maximum size. That means we need compile time parameters here, maybe also read environment variables at creation time.
  • We should consider a "more complex" designs that allow to trade of wasted memory versus number of allocations. For example, we can allocate an array with N elements of size S if we run out of S-sized nodes. That further minimizes the number of runtime calls through the entire system. Let's leave that for later though.
openmp/libomptarget/src/device.cpp
211

Nit: make tp a void * and cast the one use of it as uintptr_t instead.

304

Nit: Remove the cast.

openmp/libomptarget/src/memory.cpp
62

No inline but static please. Same above. This feels like something we could use from llvm other share with libomp... this code duplication is a nightmare.

Anyway, as @JonChesterfield mentioned, we should aim for a unit test here. We could also add a executable test case that hits a bucket really hard to ensure we can deal with it.

69

Function and member comments in doxygen style please. Also in the header.


Add a static assert that the node size is 2 * sizeof(void*).

71

Comment explaining this thing. I'm also very confused by the duplicated class declaration. Let's not do that.

78

Comments explaining these things. Maybe place the mutexes next to the things they protext.

Why a std::list and an unordered map?

Naturally, I would have gone with a vector or std::deque. To "delete" elements I would mark them taken. There should be 32bit padding in a Node anyway. Though hard to predict what is good.

I am unsure about the map, w/o measurements its guesswork and I would go with the regular one but this is fine.

86

Comments on all of these please. Maye allocateOnDevice as name instead?

92

Hm.. running out of memory seems like a "edge case" and if it happens it seems "likely" it happens again. Why not use the opportunity to free everything in the free list while we are here. I mean, it will be "cheaper", complexity wise, reasonably useful given that the next allocation will hit the same problem, and very much simpler.

151

If this is part of the device, the place we tear down the context, this issue should go away, I think.

154

At least for malloc and friends that is totally fine btw. If we filter this earlier we can leave the assert though.

159

Descriptive variable names are worth the trouble typing.

173

We should round the size up to increase reuse. Also makes all blocks in a bucket the same size.

183

Nothing is wrong, we just run OOM, return a nullptr and all is good.

198

As the lock is released, this can/should go into a helper function.

openmp/libomptarget/src/memory.h
22

If there is no private state I'd go for struct. Though I would have expected private state TBH.

36

I think the MemoryManager, like the StreamManager, is a thing that belongs to a Device. Different devices might choose different implementations etc. That also reduced our global state footprint. Note that you can and should keep the memory.{h,cpp} files, but make the object part of a Device if possible.

Refactored the whole patch. Corresponding tests will be added soon.

tianshilei1992 marked 18 inline comments as done.Sat, Aug 1, 7:09 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
211

Unrelated to this patch so mark it as Done.

304

Unrelated to this patch so mark it as Done.

openmp/libomptarget/src/memory.cpp
71

Not duplicate. I just don't want to put too many things into a header that will be included by others. Use PImpl could make things better.

151

Comments seems out of date.

154

Comments seems out of date.

173

Comments seems out of date.

198

Comments seems out of date.

tianshilei1992 marked 7 inline comments as done.Sat, Aug 1, 7:10 PM
tianshilei1992 edited the summary of this revision. (Show Details)Sat, Aug 1, 7:12 PM

Updated the function name in exports

tianshilei1992 added inline comments.Sat, Aug 1, 8:53 PM
openmp/libomptarget/src/memory.cpp
236

This line should be removed.

Removed a wrong line of code

Fixed a minor issue

Took type alias into the class

Added debug print

Rebased before moving to allocator

jdoerfert added inline comments.Mon, Aug 3, 3:47 PM
openmp/libomptarget/src/device.h
32

Can we call these things MemoryManagerInterface and MemoryManagerImpl instead?

openmp/libomptarget/src/memory.cpp
11

Can you add description of the algorithm here please. What is happening and why.

36
39
67

-inline +static

openmp/libomptarget/src/memory.h
32

Describe what Threshold does (in some detail)

tianshilei1992 added inline comments.Mon, Aug 3, 6:02 PM
openmp/libomptarget/src/memory.cpp
67

I didn't get that. Why does inline not work here? This function is so simple such that I would like to see it is inlined by the compiler.

Updated based on comments

tianshilei1992 marked 3 inline comments as done.Mon, Aug 3, 6:44 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.h
32

I renamed the implementation class to MemoryManagerImplTy.

tianshilei1992 marked an inline comment as done.Mon, Aug 3, 6:44 PM
jdoerfert added inline comments.Tue, Aug 4, 11:40 PM
openmp/libomptarget/src/memory.cpp
67

inline is two things: a "hint" which affects the inliner heuristic and a way to get linkonce_odr linkage for functions. It is not a way to force inlining, that is __attribute__((always_inline)). That said, there is no need to tell the inliner what to do anyway but always limit the lifetime of things, so make them static if possible. Take a look at https://godbolt.org/z/Mjnhe8 to see the effects different annotations have.

Updated according to comments

tianshilei1992 marked 6 inline comments as done.Wed, Aug 5, 12:32 PM

Updated the calculation of NumBuckets

tianshilei1992 marked an inline comment as done.Wed, Aug 5, 12:39 PM

Use const_iterator

Make mutex close to their protected variables