This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][NFCI] Cleanup the target synchronization implementation
Needs ReviewPublic

Authored by jdoerfert on Jul 4 2019, 12:46 PM.

Details

Reviewers
openmp-commits
Summary

Note: WIP patch 2/3 to go with a RFC for the device RTL design (see D64217)

This NFCI patch includes the following cleanup steps:
  - Adjust the code according to the LLVM coding style, especially wrt.
    variable and method names.
  - Document the code with doxygen comments.
  - Change the comments to be less NVPTX specific.
  - Wrap CUDA specific calls into __kmpc_impl_XXX functions and define
    them in an own target_impl.h file.
  - Use a templated barrier implementation to remove code duplication.
  - Use a (macro) generator to reduce code duplication.

Event Timeline

jdoerfert created this revision.Jul 4 2019, 12:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2019, 12:46 PM
Herald added subscribers: bollu, mgorny. · View Herald Transcript
jdoerfert retitled this revision from [OpenMP][NFCI] Cleanup the target synchronization implementation Note: WIP patch 2/3 to go with a RFC for the device RTL design (see D64217) to [OpenMP][NFCI] Cleanup the target synchronization implementation.Jul 4 2019, 12:48 PM
jdoerfert edited the summary of this revision. (Show Details)
jdoerfert updated this revision to Diff 208076.Jul 4 2019, 12:54 PM
jdoerfert edited the summary of this revision. (Show Details)

Add missing return type

ABataev added inline comments.Jul 4 2019, 3:48 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

I don't think this is correct. IsSPMD flag should be passed as a function parameter. Sometimes, we cannot define the execution mode at the compile time and we could define it only at the execution time (foe example, if the parallel region is called in the orphaned function, marked as noinline or compiled without optimizations, etc.)

jdoerfert marked an inline comment as done.Jul 5 2019, 10:35 AM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

It is "correct" and it "works" with the rest of the code base but we can change it regardless:

It works this way because we have explicit __kmpc_barrier_XXXXX functions for the SPMD and non-SPMD case. Through that level of abstraction we know the required barrier implemenetation at compile time. If we want to move avay from the different barrier types that have the mode baked into their name, we would need to make the template parameters arguments for sure.

Long story short, I do not have strong feelings about this and it should not matter after inlining and constant propagation.

ABataev added inline comments.Jul 17 2019, 2:50 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

Actually, we use __kmpc_barrier in many cases. Even in SPMD mode. __kmpc_barrier_xxxx variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded IsSpmd to false in __kmpc_barrier. The fact that it works for you just means that you have very limited test set.
Inlining and constant propagation is not an option here. What if the user compiled the code at O0, without optimizations? Jus to debug the code? We should produce different results at O0 and O3? Or explicitly marked the function as noinline?

jdoerfert marked an inline comment as done.Aug 5 2019, 3:28 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

Actually, we use kmpc_barrier in many cases. Even in SPMD mode. kmpc_barrier_xxxx variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded IsSpmd to false in __kmpc_barrier. The fact that it works for you just means that you have very limited test set.

Please take a look at line 52. As before, a call to __kmpc_barrier will first check if we are in SPMD mode. (Even if the template argument is false that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had.

Inlining and constant propagation is not an option here.

I do not understand what you are taking about. This does not, as nothing ever can, rely on inlining and constant propagation.

What if the user compiled the code at O0, without optimizations? [...]

They get the same semantics but slower.

ABataev added inline comments.Aug 5 2019, 3:51 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

Please take a look at line 52. As before, a call to __kmpc_barrier will first check if we are in SPMD mode. (Even if the template argument is false that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had.

Then why do we need this template argument if it does nothing but just confuses people?

ABataev added inline comments.Aug 5 2019, 4:06 PM
openmp/libomptarget/deviceRTLs/nvptx/src/sync.cpp
18

We don't support cancellation in the GPU runtime currently, so I think it is better to set IsCancellable to false to make it clear that cancellation is not supported yet.

18

I think I got the idea for these params. But it is better to give them some other names, currently they might confuse users.

37

Not sure that this instantiation provides the same afunctionality as the original implementation. Originally, it did not check for ghe number of active threads, just synced all non-SPMD threads unconditionally.

56

Can we invent something else here, not the macros?

88

This is impossible to get rid of it for NVPTX runtime, at least. Better to make it NVPTX specific function.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
55

This is not correct. kmpc_barrier can be called even in SPMD mode. Generally speaking, the spmd suffix also must be generated dynamically.

JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

Inlining and constant propagation is not an option here. What if the user compiled the code at O0, without optimizations?

Could you expand on your concern here? The code looks like it does the same thing at O0 and O3 to me (no calls to _builtin_constant_p) so O0 just means slower. That seems OK.

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48

That was the wrong assumption on the meaning of the template arguments, I got the idea already but it is better to rename the arguments somehow because currently, they are confusing.

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

I'm happy to rename them if we want to go ahead with this __kmpc_impl_barrier (potentially without the rest in here). Name suggestions are always welcome.

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

I'm happy to rename them if we want to go ahead with this __kmpc_impl_barrier (potentially without the rest in here). Name suggestions are always welcome.

It would be good to try to gather the common functionality into some kind of class, maybe? Common functionality could be encapsulated into some kind common functionality templated class, while target specific functions could be encapsulated into a target specific class, which could be used as a template argument for common functionality template instantiation.
Something like:

template <typename TargetT>
class CommonFunctionality : public TargetT {
  void barrier(....) {
     if (spmd-mode) {
       TargetT::spmd_barrier();
     } else if (num_threads > 1) {
       TargetT::non_spmd_barrier();
     } else {
       TargetT::flush(); 
     }
  }
};

What do you think about it?
For the template arguments, maybe just do not make it a templated function? Maybe it is better to make set of simple functions and call them in the complex ones?

@JonChesterfield @atmnpatel @tianshilei1992 Unclear if this is still needed and/or applies. Feel free to take a look.

This looks like a difficult rebase. Some parts are obsolete (__kmpc_impl_active_thread_mask). Lifting runtime parameters with associated branches to compile time via the template is interesting but I wouldn't have guessed it's where we're losing most performance. Constant propagation probably does the same job with the bitcode RTL.

I'd be inclined to abandon this patch and recreate it if desired

This looks like a difficult rebase. Some parts are obsolete (__kmpc_impl_active_thread_mask). Lifting runtime parameters with associated branches to compile time via the template is interesting but I wouldn't have guessed it's where we're losing most performance. Constant propagation probably does the same job with the bitcode RTL.

It was not about performance but code duplication. Back then, maybe now too, we have various copies of the barrier implementation which is, sub-optimal.

I'd be inclined to abandon this patch and recreate it if desired

Agreed. But I might not be able to do that any time soon.

openmp/libomptarget/deviceRTLs/nvptx/src/sync.cpp