Page MenuHomePhabricator

[OpenMP] Introduce target memory manager
Needs ReviewPublic

Authored by tianshilei1992 on Jun 2 2020, 10:12 PM.

Details

Summary

Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:

  1. The program ends. Obviously. However, there is a little problem that plugin

library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.

  1. Device is out of memory when we request a new memory. The manager will walk

through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jun 2 2020, 10:12 PM

Updated function names to conform with LLVM code standards

We definitely want faster memory allocation on the target. This is an interesting piece for that.

This patch implements a memory pool on top of Device->RTL->data_alloc. It's interesting that there's a performance hazard with cuda there. The hsa layer amdgpu builds this on has memory pools around the kernel allocation, so I'm not sure this would be of direct benefit for amdgpu.

Memory allocators are notoriously difficult to implement efficiently in the face of unknown workload. Can you share benchmarks that lead to this design?

Similarly they're really easy to get wrong. There's a lot of subtle arithmetic in this patch. It would be prudent to cover this with tests, e.g. stub out the RTL->data_alloc calls for malloc/free and run targeted and fuzz tests under valgrind.

A second part of this puzzle is device side memory pools, so that malloc/free from a target region doesn't (always) have to call into the host. That may end up being quite platform dependent. That seems orthogonal to this patch.

openmp/libomptarget/src/memory.cpp
32

This maps:
0 -> 0
1 -> 1
2 -> 2
3 -> 2
4 -> 4
which is not the previous power of two. Round down to a power of two could be:
x < 2 ? x : 1 << (31 - __builtin_clz(x - 1))

56

Tests for this arithmetic?

71

Why list over smallvector? I can't see a need for iterator stability here

94

This is quadratic - each pass around the loop walks through each node of the list

111

LLVM is built with exceptions disabled, so probably shouldn't raise here

jdoerfert added inline comments.Jun 3 2020, 9:00 AM
openmp/libomptarget/src/memory.cpp
111

This is the runtime, so exceptions "would work". However, no exceptions please. There is no defined interface and no reason to believe the user has a C++ exception handler waiting.

tianshilei1992 added inline comments.Jun 3 2020, 7:20 PM
openmp/libomptarget/src/memory.cpp
32

That is actually what my expectation. This function is for a number that is not a power of 2. The comment is not accurate, and I'll update it.
The intention here is to distribute different buffers to different buckets based on its previous power of two. For example, 1024, 1025, 1100, 2000 will all go to the bucket with size 1024.

71

Any good suggestion? I also think this style is a little weird, but cannot find a better one.

94

In the worse case, yes. The worse case is equivalent to release all free buffers. That's why this procedure starts from the bucket with largest size. Each time we release one buffer, we will try allocation once, until the allocation succeeds.

tianshilei1992 added a comment.EditedJun 3 2020, 7:34 PM

Thank you Jon for the review! The comments are really precious.

Memory allocators are notoriously difficult to implement efficiently in the face of unknown workload. Can you share benchmarks that lead to this design?

The benchmark is quite simple:

#pragma omp parallel for
for (int i = 0; i < 4096; ++i) {
#pragma omp target team distribute map(...)
  { /* kernel here */ }
}

We have 4096 tasks, and depending on the number of threads N, we have N target region offloaded almost at the same time such that M of them might be executing simultaneously. For each kernel, the RT will allocate memory for it before its execution and free the memory after the execution. From NVVP, we observed that cuMemFree is very expensive, especially the computation is light but depends on a large amount of memory. What's more, during the cuMemFree, there is no context switch on the device, even though there are actually multiple kernels executing at the same time. From the profiling result of IBM XL OpenMP, we found that they don't call cuMemFree after each execution, and that is why we're thinking to have a memory pool.

Similarly they're really easy to get wrong. There's a lot of subtle arithmetic in this patch. It would be prudent to cover this with tests, e.g. stub out the RTL->data_alloc calls for malloc/free and run targeted and fuzz tests under valgrind.

That sounds reasonable. Will do it.

A second part of this puzzle is device side memory pools, so that malloc/free from a target region doesn't (always) have to call into the host. That may end up being quite platform dependent. That seems orthogonal to this patch.

That part should be covered in the plugin which is currently not the focus of this patch. But maybe we could avoid doing memory allocation and free in plugin I guess.

ye-luo added a subscriber: ye-luo.EditedThu, Jun 18, 3:02 PM

I think this optimization can be an option but not replacing the existing scheme directly allocate/free memory.
Application may request device memory outside openmp and use vendor native programming model or libraries.
Having libomptarget holding large memory doesn't make sense.
You may consider using the pool only for very small allocation requests <1M.
It is application's responsibility to take care of large memory allocation.

I think this optimization can be an option but not replacing the existing scheme directly allocate/free memory.
Application may request device memory outside openmp and use vendor native programming model or libraries.
Having libomptarget holding large memory doesn't make sense.
You may consider using the pool only for very small allocation requests <1M.
It is application's responsibility to take care of large memory allocation.

I agree with you that this optimization could be optional, enabled/disabled by an environment variable maybe, such that power users can still take care of everything on their own, but I don’t think every user is power user, and would use OpenMP offloading in an expert’s way that allocating memory using device RTL functions and then use them directly. They might still use interfaces provided by OpenMP to allocate device memory.

Having a allocation size limit may work. Below MAXSIZE, got to manager. Above MAXSIZE, go directly.
Power user can do even MAXSIZE=0 to fully skip the manager if they want.

Having a allocation size limit may work. Below MAXSIZE, got to manager. Above MAXSIZE, go directly.
Power user can do even MAXSIZE=0 to fully skip the manager if they want.

I like that idea.
By default I want some memory management for people and the smaller the allocations the more important it is. That said, opting out should always be an option.

@tianshilei1992 please also make sure that the runtime shutdown routines will free this memory.

I happened to find that the huge overhead of cuMemFree might be due to the fact that it is called when the data is still being used. Will come back to this patch after I fix the issue and re-evaluate whether we still need this.