Page MenuHomePhabricator

[OpenMP][Offloading][1/3] A generic and simple target region interface
Needs ReviewPublic

Authored by jdoerfert on Mar 13 2019, 1:14 PM.

Details

Summary

This patch introduces an alternative OpenMP GPU kernel offloading
interface called target kernel region (or TRegion).

The commit includes the runtime library implementation for the NVPTX
device plugin, implemented mostly in terms of the existing
functionality.

The interface is deliberately simple to be easily analyzable in the
middle end. Design decisions included:

  • Hide all (complex) implementation choices in the runtime library but allow complete removal of the abstraction once the runtime is inlined.
  • Provide all runtime calls with sufficient, easy encoded information.

Event Timeline

jdoerfert created this revision.Mar 13 2019, 1:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2019, 1:14 PM
jdoerfert updated this revision to Diff 190484.Mar 13 2019, 1:17 PM

Simplify the commmit further

ABataev added inline comments.Mar 13 2019, 1:40 PM
openmp/libomptarget/cmake/Modules/LibomptargetNVPTXBitcodeLibrary.cmake
81 ↗(On Diff #190483)

It must be in a separate patch

88 ↗(On Diff #190483)

Same, if you really need it - separate patch

105 ↗(On Diff #190483)

Same here

openmp/libomptarget/deviceRTLs/common/target_region.h
28

All exported functions are declared in the interface.h file. I don't think we need an extra interface file here

101

Better to use ident_loc for passing info about execution mode and full/lightweight runtime.

jdoerfert marked 5 inline comments as done.Mar 13 2019, 2:35 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/common/target_region.h
28

interface.h, or to be more precise for people that do not know, deviceRTLs/nvptx/src/interface.h, is nvptx specific. This file, deviceRTLs/common/target_region.h, is by design target agnostic and not placed _under_ the nvptx subfolder. If you are willing to move interface.h into a common space and remove the nvptx specific functions we can merge the two. Otherwise, I have strong reservations agains that and good reason not to do it.

101

Could you please explain why you think that? Adding indirection through a structure does not really seem beneficial to me.

ABataev added inline comments.Mar 14 2019, 8:29 AM
openmp/libomptarget/deviceRTLs/common/target_region.h
28

I see that currently it is written in Cuda. It means, it targets NVidia GPUs, at least at the moment. I'm fine to put this header file into the common directory, if you're sure that this is really target agnostic. But maybe just for a start we should put it to NVPTX directory? Later, when you or somebody else will add support for other GPUs and he/she will find out that these functions are really target agnostic, they can be moved into the common directory?

101

Almost all function from libomp rely on ident_loc. The functions, which were added for NVPTX without this parameter had a lot of problems later and most of them were replaced with the functions with this parameter type. Plus, this parameter is used for OMPD/OMPT and it may be important for future OMPD/OMPT support.

125

We used void * for buffers usually, I think it is better to use void * here too instead of char *.

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

It would be good to store it the global memory rather than in the shared to save th shared memory. Also, we already are using several shared memory buffers for different purposes, it would be good to merge them somehow to reduce pressure on shared memory.

openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
65

What is the criteria for UseStateMachine? Under what conditions it can be set to true and false? Also, what if have several parallel regions in non-SPMD kernel and UseStateMachine is true?

jdoerfert marked 6 inline comments as done.

Change char* to void*

jdoerfert added inline comments.Mar 14 2019, 12:49 PM
openmp/libomptarget/deviceRTLs/common/target_region.h
28

I see that currently it is written in Cuda. It means, it targets NVidia GPUs, at least at the moment

How do you see that? (I hope we both talk about this file, correct?)

But maybe just for a start we should put it to NVPTX directory?

Why? What is the benefit? If we want it to be agnostic, regardless of the current state, it should be developed _outside_ of the target specific directories.

101

Almost all function from libomp rely on ident_loc.

If you look at the implementation of this interface for NVPTX you will see that the called functions do not take ident_loc values. When you create the calls from the existing NVPTX code generation in clang, the current code does not use ident_loc for similar functions, see:
___kmpc_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime),
__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized),
__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime, int16_t RequiresDataSharing),
__kmpc_kernel_parallel(void **outlined_function, int16_t IsOMPRuntimeInitialized),
...

Plus, this parameter is used for OMPD/OMPT and it may be important for future OMPD/OMPT support.

If we at some point need to make the options permanent in an ident_loc we can simply pass an ident_loc and require it to be initialized by the call. Cluttering the user code with stores and indirection is exactly what I do want to avoid.

125

Thanks, fixed.

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

I would have reused your buffer but it is for reasons unclear to me, not a byte-wise buffer but an array of void * and also used as such. Using it as a byte-wise buffer might cause problems or at least confusion. Changing it to a byte-wise buffer would be fine with me. I don't need a separate buffer but just one with the functionality implemented in this one.

openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
65

What is the criteria for UseStateMachine? Under what conditions it can be set to true and false?

UseStateMachine is an option exposed to the outer world through the __kmpc_target_region_kernel_init call. The semantics are explained here and in the declaration of __kmpc_target_region_kernel_init.

Also, what if have several parallel regions in non-SPMD kernel and UseStateMachine is true?

I don't see the problem, I expect all kernels having threads in their own state machine and no interference between them. That is at least what should happen. Maybe I miss something. Do you see a problem?

ABataev added inline comments.Mar 14 2019, 1:06 PM
openmp/libomptarget/deviceRTLs/common/target_region.h
28

I'm not talking about this particular file, just like I said we can put it into common subdirectory. I'm talking about the implementation files. They all are written in Cuda, no?
But it is not proved yet that this solution is target agnostic. Did you test it for AMD?

101
  1. The new functions rely on ident_loc. We had to add those new functions because the old ones did not use it and it was bad design decision. Now we need to fix this. I suggest you do everything right from the very beginning rather than fixing this later by adding extra entry points to support OMPT/OMPD or something else, for example.
  2. No, you cannot simply change the interface of the library to keep the compatibility with the previous versions of the compiler/library. You will need to add the new entries.
openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

I don't know what my buffer are talking about. I'm just saying that we already using a lot of shared memory and adding another one shared memory buffer of ~150 bytes per team increases pressure on the shared memory. It would be good to reuse the existing buffers somehow. It was just a suggestion.

openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
65
  1. I see its semantics, I'm asking when it must be set to true and when to false. Maybe I missed something, but currently, it is always set to true in the compiler patch. Do you really need it?
  2. What if you have a single kernel with several consecutive parallel regions? Can you handle this?
167

Use void * also, better to keep the same coding style across the whole library

jdoerfert updated this revision to Diff 190717.Mar 14 2019, 1:46 PM
jdoerfert marked 4 inline comments as done.

Replace more char* with void*

jdoerfert marked 2 inline comments as done.Mar 14 2019, 1:46 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/common/target_region.h
28

I'm not talking about this particular file, just like I said we can put it into common subdirectory.

OK. It is (the only file in the common folder for now).

I'm talking about the implementation files. They all are written in Cuda, no?

Yes, Cuda, and placed under the nvptx folder for that reason. That is what you want, correct?

But it is not proved yet that this solution is target agnostic. Did you test it for AMD?

What do you mean by solution? I do not have a second implementation of the interface but nothing up to the implementation of the interface is target aware. By construction, this means it will work for anything we can implement the interface in.

Why do you fight so hard against this? What exactly do you want to change here? Given the last comment, and assuming I understand you correctly, the files are all exactly where you want them to be. That the wording sometimes states "target agnostic" is a sign of intent, even if for some currently unknown reason it would not hold true.

101

Let's start this one again because I still haven't understood. Why do we need to populate the ident_loc again? What information has to be in there at which point? I want this to be clear because a lot of other "design decisions" of the existing code base are in my opinion not necessary and consequently missing here. That includes, for example, various global variables. If we have a description of the problem you try to solve with the ident_loc we might be able to find a way that cuts down on state.

Regarding the "compatibility", this is not a stable interface people can rely on. Whatever is committed in this first patch is not set in stone. Also, we can _always_ add a __kmpc_init_ident_loc(....) function after the fact.

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

I don't know what my buffer are talking about.

Sorry, my bad. The one you see in the (last part of the) implementation below in the beginning of the shown lines of openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h. It is called omptarget_nvptx_SharedArgs and it does (a subset of) what this new buffer does, providing space for shared variables in parallel regions.

I'm just saying that we already using a lot of shared memory and adding another one shared memory buffer of ~150 bytes per team increases pressure on the shared memory. It would be good to reuse the existing buffers somehow. It was just a suggestion.

I understand and I agree. My comment explained why I didn't do that in the first place, hoping that you see the problem and agree we should rewrite the users of omptarget_nvptx_SharedArgs to use target_region_shared_buffer[1], thereby reducing the required shared memory.

[1] The name is subject to change! I don't care much.

openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
65

I see its semantics, I'm asking when it must be set to true and when to false. Maybe I missed something, but currently, it is always set to true in the compiler patch. Do you really need it?

Yes, because the LLVM optimizer pass [1] will change the value.

What if you have a single kernel with several consecutive parallel regions? Can you handle this?

Yes. That works.

[1] https://reviews.llvm.org/D59331#C1385083NL563 (line 566-568 in OpenMPOpt.cpp)

167

Done.

ABataev added inline comments.Mar 14 2019, 2:34 PM
openmp/libomptarget/deviceRTLs/common/target_region.h
28

I'm trying to understand what is the best layout for your solution.

101

Ident_loc holds the data about current source code location, execution mode and is full runtime required or not. Also, it is used in OMPT/OMPD support.
Regarding "compatibility" libraries must be most stable part of the compiler, because the user migbt need to link the old object file/library with the new one. Because of this the new versions of libraries must be compatible with old ones. And you need to maintain the deprecated parts to keep the compatibility with the previous versions. All these libs already have a lot of old code that because of the initial poor design and we need to maintain them. I would like to avoid this situation with this patch.

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

This is not my buffer. Unfortunately, I did not work on this library since the very beginning. There are some other buffers, generated by the compiler, for example, and we can try to reuse them.

jdoerfert marked 3 inline comments as done.Mar 14 2019, 3:03 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/common/target_region.h
101

Ident_loc holds the data about current source code location, execution mode and is full runtime required or not. Also, it is used in OMPT/OMPD support.

We can store that information through a __kmpc_init_ident_loc(....) call once needed.

Regarding "compatibility" libraries must be most stable part of the compiler, because the user migbt need to link the old object file/library with the new one. Because of this the new versions of libraries must be compatible with old ones. And you need to maintain the deprecated parts to keep the compatibility with the previous versions. All these libs already have a lot of old code that because of the initial poor design and we need to maintain them. I would like to avoid this situation with this patch.

The way I understand you now is that you want a way to extend the interface in the future and adding a changeable ident_loc pointer is your proposed way. Do I understand your reaonsing for ident_loc here correctly or is it (this and) something else?

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190484)

This is not my buffer.

My "you" was not directed at you but a general one. The wording was bad, my apologies.

There are some other buffers, generated by the compiler, for example, and we can try to reuse them.

I'm not 100% sure which buffers you refer to here but I think that are the ones the new code generation does not emit anymore.

I'm all for merging/replacing multiple buffers implemented in the device RTL, I didn't do it because it breaks compatibility or it forces me to inherit design choices I dislike (the void** buffer). From my perspective we could get rid of the existing omptarget_nvptx_SharedArgs space by letting it use the target_region_shared_buffer internally. That solves the problem for now and once omptarget_nvptx_SharedArgs isn't directly needed anymore it is removed.

jdoerfert updated this revision to Diff 190767.Mar 14 2019, 6:16 PM
jdoerfert marked 4 inline comments as done.

Add ident_t* to the interface functions

openmp/libomptarget/deviceRTLs/common/target_region.h
101

Addent ident_t everywhere.

jdoerfert marked an inline comment as done.Mar 14 2019, 10:02 PM
ABataev added inline comments.Mar 15 2019, 10:39 AM
openmp/libomptarget/deviceRTLs/common/target_region.h
105

If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters are not needed anymore. They are passed in ident_t structure.

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
70 ↗(On Diff #190767)

What is this buffer used for? Transferring pointers to the shread variables to the parallel regions? If so, it must be handled by the compiler. There are several reasons to do this:

  1. You're using malloc/free functions for large buffers. The fact is that the size of this buffer is known at the compile time and compiler can generate the fixed size buffer in the global memory if required. We already have similar implementation for target regions, globalized variables etc. You can take a look and adapt it for your purpose.
  2. Malloc/free are not very fast on the GPU, so it will get an additional performance with the preallocated buffers.
  3. Another one problem with malloc/free is that they are using preallocated memory and the size of this memory is limited by 8Mb (if I do recall correctly). This memory is required for the correct support of the local variables globalization and we alredy ran into the situation when malloc could not allocate enough memory for it with some previous implementations.
  4. You can reused the shared memory buffers already generated by the compiler and save shared memory.
jdoerfert marked an inline comment as done.

Rebase onto D59424 and fix errors caused by the wrong use of ident_t

jdoerfert marked an inline comment as done.Mar 15 2019, 11:46 AM

What is this buffer used for? [...]

I'll copy your comment and respond in this review D59424.

openmp/libomptarget/deviceRTLs/common/target_region.h
105

If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters are not needed anymore. They are passed in ident_t structure.

They are not in the TRegion interface, at least not by the TRegion code generation. If required, we can add that or require the __kmpc_target_region_kernel_init implementation to store the values in the ident_t. Regardless, we do not want to hide the variables in the ident_t because that would result in worse analysis results and cause optimizations to be harder. The main point of all these changes is, after all, to make optimizations easy. Given that we expect these functions to be inlined, there is also no harm done wrt. runtime costs.

ABataev added inline comments.Mar 15 2019, 11:53 AM
openmp/libomptarget/deviceRTLs/common/target_region.h
105

This is why we used them. Those ident_ts are constant and it allows us to perform an additional optimization in the functions, that do not have isSPMDMpde and RequiresFullRuntime. Because of this parameter, we gained a significant performance boost. LLVM knows how to deal with the structures, don't worry about the optimization.

Fix a typo (use of wrong variable) and improve comments

jdoerfert marked an inline comment as done.Mar 15 2019, 12:20 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/common/target_region.h
105

This is why we used them. Those ident_ts are constant and it allows us to perform an additional optimization in the functions, that do not have isSPMDMpde and RequiresFullRuntime.

The boolean parameters are (currently) also constant. The main point however is that in our expected use case, an inlined device RTL, there is literally no cost to pay by having the flags explicit as parameters.

Because of this parameter, we gained a significant performance boost.

Compared to what? Not having information about the execution mode, etc. at all? How would that become worse?

LLVM knows how to deal with the structures, don't worry about the optimization.

I am (painfully) aware of LLVM's capability to promote arguments (that is what is needed if we do not inline or perform IP-SCCP). However, using a pointer does allow the use of non-constant ident_t values, which are problematic. They might actually be useful for the original purpose of ident_t, namely location information. Think function merging that will cause a call with one of multiple different ident_t pointers. Making sure we can promote the values in that case is already much harder than checking if all potential values are the same boolean constant.

ABataev added inline comments.Mar 15 2019, 12:27 PM
openmp/libomptarget/deviceRTLs/common/target_region.h
105
  1. This is the data duplication.
  2. Compared to the previous implementation.
  3. It allows, yes, but the compiler generates constant ident_t. This structure used not only for the location information, but it used also for other purposes. There are no problems with the code inlining and optimization for ident_ts.
jdoerfert marked an inline comment as done.Mar 15 2019, 12:46 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/common/target_region.h
105
  1. This is the data duplication.

What is? Having explicit constant boolean parameters? There is no "duplication" if they are constant and the functions are inlined. If you really think otherwise, I'm afraid we will not make progress here without a third opinion.

  1. Compared to the previous implementation.

I do not know what the previous implementation was. I'm also unsure what the point is you are trying to make. If it is different from point 1., could you please elaborate?

  1. It allows, yes, but the compiler generates constant ident_t. This structure used not only for the location information, but it used also for other purposes. There are no problems with the code inlining and optimization for ident_ts.

For now, maybe. I just gave you a very plausible example of how there could be performance implications in the near future due to the indirection compared to explicit boolean parameters.

Introduce a ternary mode for parallel regions, fix minor mistakes

Could you add some tests for this?

Could you check what the difference is between the same kernel in today's SPMD mode vs the SPMD mode produced via this method? Number of registers, instructions, checking everything gets optimized out as expected. The LLVM-IR should be almost identical.