Page MenuHomePhabricator

[OpenMP][NVPTX] Replace void** buffer by byte-wise buffer
Needs ReviewPublic

Authored by jdoerfert on Mar 15 2019, 11:35 AM.

Details

Summary

This commit implements the existing void buffer used to share
arguments between threads in a team with a byte-wise buffer. For now,
the void
buffer is kept for compatibility.

The byte-wise buffer, if used directly, allows to save memory when small
arguments are shared between team threads. It does also allow to track
an additional offset that differentiates two distinct back-to-back
memory regions, e.g., for shared (copy in & out) and firstprivate (copy
in only) variables.

This is a preparation patch for https://reviews.llvm.org/D59319

Event Timeline

jdoerfert created this revision.Mar 15 2019, 11:35 AM
jdoerfert marked an inline comment as done.

Fix the set/release use case

jdoerfert marked an inline comment as not done.Mar 15 2019, 12:06 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

What is this buffer used for? Transferring pointers to the shread variables to the parallel regions? If so, it must be handled by the compiler. There are several reasons to do this:

  1. You're using malloc/free functions for large buffers. The fact is that the size of this buffer is known at the compile time and compiler can generate the fixed size buffer in the global memory if required. We already have similar implementation for target regions, globalized variables etc. You can take a look and adapt it for your purpose.
  2. Malloc/free are not very fast on the GPU, so it will get an additional performance with the preallocated buffers.
  3. Another one problem with malloc/free is that they are using preallocated memory and the size of this memory is limited by 8Mb (if I do recall correctly). This memory is required for the correct support of the local variables globalization and we alredy ran into the situation when malloc could not allocate enough memory for it with some previous implementations.
  4. You can reused the shared memory buffers already generated by the compiler and save shared memory.

[Quote by ABataev copied from https://reviews.llvm.org/D59319?id=190767#inline-525900 after the patch was split.]

This buffer is supposed to be used to communicate variables in shared and firstprivate clauses between threads in a team. In this patch it is simply used to implement the old void** buffer. How, when, if we use it is part of the interface implementation. For now, this buffer simply serves the users of the omptarget_nvptx_globalArgs global.

If you want to provide compiler allocated memory to avoid the buffer use, no problem,
the __kmpc_target_region_kernel_parallel function allows to do so, see the SharedMemPointers flag. I wouldn't want to put the logic to generate these buffers in the front-end though.

ABataev added inline comments.Mar 15 2019, 12:08 PM
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Why?

jdoerfert marked an inline comment as done.Mar 15 2019, 12:21 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Why what?

ABataev added inline comments.Mar 15 2019, 12:29 PM
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Why you don't want to put the buffers generation to the compiler?

jdoerfert marked an inline comment as done.Mar 15 2019, 12:37 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Why you don't want to put the buffers generation to the compiler?

I did never say that. I explicitly explained how the new interface allows you to do exactly that. Maybe you confuse it with me not wanting the generation to be part of the front-end (=Clang). That is because it is an implementation choice for performance, as such it should not be done by Clang if it might obstruct analyses later on, or could be done better with more analysis support.

If we conclude we actually need to share values, after we tried to eliminate the need for sharing altogether, we can decide if a global buffer is preferable. If so, we can check if
there is already one that is unused or if we would need to create a new one.

Regarding this patch: It is not supposed to change the behavior at all. This patch just introduces a more general buffer which is then used to implement the old buffer. How/when the buffer is used is not affected.

ABataev added inline comments.Mar 15 2019, 12:46 PM
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Only Clang can do better analysis for the shared variables, runtime is not. This part must be implemented in Clang, not in the runtime. I had no time to fix the existing implementation of parameters passing in non-SPMD mode. But if you working on this, you should definetely implement this in the compiler, not in the library.
It does not brea the analysis or anything else. It really produces better code with the better performance and less memory use.
If you're going to implement it, you need to implement it in the best possible way. Who else is going to fix this later, when we ran into the problems with the shared memory and/or malloced memory?

jdoerfert marked an inline comment as done.Mar 15 2019, 1:01 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
73

Only Clang can do better analysis for the shared variables, runtime is not. This part must be implemented in Clang, not in the runtime. I had no time to fix the existing implementation of parameters passing in non-SPMD mode. But if you working on this, you should definetely implement this in the compiler, not in the library.

Clang is not the right place for analysis. That is what the patch set is all about.

I do not propose to only use the library implementation to handle sharing.

The initially generated code will utilize a library based solution that potentially calls malloc and memcpy. However, the later LLVM passes can change the behavior by providing a buffer that is accessible by all threads in the team, eliminating the need for malloc and memcpy.
This is a *generalization* of the scheme you currently employ, eliminating the need for mallocs completely (which is not the case right now).

It does not brea the analysis or anything else. It really produces better code with the better performance and less memory use.

Adding (global) state makes analysis results rarely better but almost always worse. If you do not believe me, please feel free to ask a compiler developer that you trust.

If you're going to implement it, you need to implement it in the best possible way.

I totally agree.

Who else is going to fix this later, when we ran into the problems with the shared memory and/or malloced memory?

Please read my comments again. I explain multiple times that:

A) This patch has nothing to do with your problem.
B) How the proposed TRegion interface tackles your problem.

Could you check if there is any change in the number of registers new scheme vs. old scheme?