This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][NFCI] Cleanup the target state queue implementation
Needs ReviewPublic

Authored by jdoerfert on Jul 4 2019, 12:42 PM.

Details

Reviewers
openmp-commits
Summary

Note: WIP patch 1/3 to go with a RFC for the device RTL design.

This NFCI patch includes the following cleanup steps:

  • Merge the state-queue.h and state-queuei.h as the separation was artificial, and the files had a cyclic include dependence anyway.
  • Remove the now obsolete state-queuei.h.
  • Adjust the code according to the LLVM coding style, especially wrt. variable and method names.
  • Change the names of types to be generic, or just less NVPTX specific.
  • Wrap CUDA specific calls into __kmpc_impl_XXX functions and define them in an own target_impl.h file.

Event Timeline

jdoerfert created this revision.Jul 4 2019, 12:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2019, 12:42 PM
Herald added subscribers: jfb, bollu. · View Herald Transcript
ABataev added inline comments.Jul 4 2019, 3:41 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

I think it is better to mark it as Cuda, not C/C++, since it contains some Cuda specific constructs.

26

Maybe it is better to name it in LLVM style, without underscores etc.?

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

Why do we need these functions? IMHO, they do not add anything useful, just increase code complexity.

jdoerfert marked 5 inline comments as done.Jul 5 2019, 10:30 AM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

I though I removed all of them, if I haven't we should. The idea is that this "core logic" code should _not_ be CUDA code but something else, e.g., C++ seems natural.

26

I like the idea for for internal types. It will make it more consistent and the external interface also easier to spot.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

They add a level of abstraction around the CUDA implementation, here the CUDA functions atomicAdd and atomicExch. In the AMD target_impl.h the __kmpc_impl_atomic_exchange would maybe be implemented with an explicit compare-exchange loop (I'm guessing here).

ABataev added inline comments.Jul 5 2019, 12:35 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

Nope, INLINE macro is expanded into cuda specific attributes.

jdoerfert marked 4 inline comments as done.Jul 5 2019, 1:37 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

It is a macro and macros are not specific to CUDA. To what it expands if this file is included from the CUDA code right now is not relevant. This is a first step and we will have to move these files into the common folder (see the RFC). The macro definition will be in the device specific parts and they can then use whatever device (language) specific extensions they want without making these part specific to that device (language) extension. That is, after all, the main motivation behind these patches.

ABataev added inline comments.Jul 5 2019, 4:59 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

But anyway, you need to compile it in cuda or hip mode, not pure c++. I think it is better for tbe developer to understand from the very beginning that this code requires cuda or hip compiler.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 5:16 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

Given that this is "a header file", there is really no telling if "it is" C/C++/Cuda/Hip/OpenCL/Sycl/...

All it should require is Clang (potentially in some mode). Given the new OpenMP standard, the mode could reasonably be "-fpoenmp -fopenmp-targets=...".

ABataev added inline comments.Jul 5 2019, 5:35 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

Not necessarily.currently, it can be compiled with nvcc too. It does not produce bc file, but can be linked and works. So, clang is not necessarily a requirement here.
What I'm saying that it is better to inform the developer that this code requires cuda or hip compiler, not pure c++ compiler.

29

I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce incorrect code. I would keep it until we drop support for cuda 8.
Plus, I would suggest to test it very seriously for other versions of cuda. Does it really works correctly? Ptxas may use some incorrect optimizations without volatile. Though I like the idea of removing them.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 6:05 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

It does not require cuda or hip.

29

These members are accessed only through *atomic* accesses. Why do we would require volatile in addition?

hfinkel added inline comments.Jul 5 2019, 6:19 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

I think that it's useful to think about this in the following way: We often write code that requires, at least given some configuration of the preprocessor, various extensions to standard C++. CUDA and HIP are C++ extensions. Code using __builtin_frame_address or __attribute__((always_inline)) or vector intrinsics is code using C++ extensions. If the code can only be compiled with CUDA extensions enabled, then by all means, using .cu is appropriate. If it can be compiled without use of C++ extensions, or with a wide variety of C++ extensions, given different configurations of the preprocessor, then just calling it a "C++" source file seems reasonable.

ABataev added inline comments.Jul 5 2019, 6:33 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

Hal, I agree, but this code cannot be compiled without cuda correctly. You need to generate the device code for these functions and because of this they must be marked as __device__ functions. This is cuda specific modifier that tells the compiler that we don't need the host version of this code, just the device part. But if you think that this is not important, ok, mark it as c++.

ABataev added inline comments.Jul 5 2019, 6:35 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

Without volatile they were not initialized properly on cuda 8. This, again, seems to me like some kind of a bug in ptxas for cuda 8. Not sure about this problem in cuda 9, it requires some additional testing.

hfinkel added inline comments.Jul 5 2019, 6:41 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

I see "*if* compiling as CUDA, we need the functions marked with __device__" as very similar to saying "*if* we compile into a Windows DLL we need the functions marked with __declspec(dllexport)". It's fine to mark it as C++ because, aside from the ABI modifiers, it's essentially "just" C++ code.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 7:10 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1

I mentioned this already but I will do it again: There is also no other CUDA code in this file. Especially, there is *no* occurrence of __device__ in this file. If put into a context, INLINE could resolve to something that includes __device__ but it could also resolve to something without __devicde__.

29

I actually did not notice that there is apparently no initialization of these values (or did I miss it?). If there is none, than that is the problem which needs fixing, e.g., uint32_t Head = 0.

ABataev added inline comments.Jul 5 2019, 7:26 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

They are initialized, but without volatile qualifier they are not initialized in the compiled code.

jdoerfert marked an inline comment as done.Jul 5 2019, 7:34 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I checked and I could not find the initialization of these variables anywhere. We should make sure the problem is not the UB of not initializing them before we require volatile here.

ABataev added inline comments.Jul 6 2019, 5:37 AM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.
I'm trying to reduce the use of this structure with my patches, because, most probably, it relies on some undocumented features.
Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table.

jdoerfert marked an inline comment as done.Jul 6 2019, 10:41 AM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.

I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?

Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table.

That is a question for a follow up patch. I did not try to change the semantics in these ones, just the coding style and organization of the code.

ABataev added inline comments.Jul 6 2019, 11:46 AM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?

Read this https://docs.nvidia.com/cuda/archive/10.1/cuda-c-programming-guide/index.html#device-memory-specifiers

jdoerfert marked an inline comment as done.Jul 9 2019, 2:02 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I fail to find the part that says they are initialized to zero. Could you specify where I can find that?

ABataev added inline comments.Jul 9 2019, 2:08 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

There is mo such part. It is implementation feature this class relies on, as I understand. I'm trying to reduce the number of such things in the library.

jdoerfert marked an inline comment as done.Jul 9 2019, 3:33 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow "makes it work". Agreed?

Couldn't we simply introduce the initialization and get rid of the volatile as all accesses (that are actually there) happen atomically anyway?

ABataev added inline comments.Jul 9 2019, 4:23 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.

Atomic operations are not the best choice. They actually sequentialize the execution and it leads to significant performance degradation. Volatile modifier allows avoiding it.

I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.

jdoerfert marked an inline comment as done.Jul 9 2019, 11:48 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific.

How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?

Atomic operations are not the best choice. They actually sequentialize the execution

  1. That is not true as a general statement, e.g., wrt. non-queue related operations, and
  2. I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue.

and it leads to significant performance degradation. Volatile modifier allows avoiding it.

What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?

I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.

Did you test the state-queue implementation without volatile but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.

ABataev added inline comments.Jul 10 2019, 6:58 AM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?

Just like I said, I also don't like it. But the atomic alternative is much worse.

  1. That is not true as a general statement, e.g., wrt. non-queue related operations, and

It is true if several threads are trying to access the same memory location. This is exactly our situation.

2.I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue.

And we do it atomically. And it leads to siginificant performance degradation. Why do you think the full runtime mode/general mode is much slower than the SPMD mode without runtime? Mostly, because of the atomic operations and lockless structures.

What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?

Not exactly with the queue itself (though, we have an atomic loop in the queue and it makes the performance significantly worse). We had the same situation with teams reductions. We had to change the scheme, because the performance was very poor (again, atomics). You can try to seearch bugzilla, there must be an error report about poor performance of teams reductions and a fix that changes the scheme to reduce the number of atomic operations.

Our problem here is that cuda does not have block/teams synchronizaion primitives. If we could have block/teams effective sync primitives we could solve this problem easily. Without them, we can use only atomic operations for synchronization and a loop. And we end up with poor performance immediately.

Did you test the state-queue implementation without volatile but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.

I tried it some time ago. After some testing I decided that it would be better to try to reduce the use of this queue instead if want to get the performance.

You can to try to implement direct initialization. You can try any test that requires full runtime mode (does not matter, SPMD or non-SPMD constructs) since you jst want to change the initialization.

jdoerfert marked an inline comment as done.Jul 10 2019, 4:13 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.


Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?

Did you (recently) run any tests, correctness or performance, related to

  • this queue?
  • the use of atomic here?
  • the use of volatile here?
  • the missing initialization here?

How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?

ABataev added inline comments.Jul 10 2019, 5:35 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.

Agree.

Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?

It initializes the global data with zero. Cuda is LLVM based and relies on the LLVM IR restrictions, I assume. These restrictions make it to zeroinitialize the whole global data.
It was not me who decided to rely on this implementation defined behavior, I'm personally not a big fan of it.

Did you (recently) run any tests, correctness or performance, related to

this queue?

Almost everytime before sending patch for a review or committing it.

the use of atomic here?

Not exactly with this structure but with the similar ones, like teams reductions buffers, memory management hash table etc.

the use of volatile here?

Yes, and least for cuda 8 it is required to have volatile here. Not sure about cuda 9+.

the missing initialization here?

Again, almost everytime when I submit a patch.

How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?

I have a statistics for the test runs with the atomics and without. These statistics results shows everything. We have different tests, the ones with big number of registers usage and small number. I'm personally using nvprof and time and many runs to make the statistics more stable.

ABataev added inline comments.Jul 10 2019, 6:05 PM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29
jdoerfert marked 2 inline comments as done.Jul 10 2019, 7:06 PM
jdoerfert added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

It initializes the global data with zero.

I fail to reproduce this in a simple test locally. I have a __device__ global variable of struct type and the LLVM-IR declaration I get for it has an undef initializer, not a zeroinitializer. How do/did you verify that it is initialized?

the missing initialization here?

Again, almost everytime when I submit a patch.

I'm talking about the queue implementation and all the things mentioned about it.

I have a statistics for the test runs with the atomics and without.

So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue?

These statistics results shows everything.

Can you share them?

29

Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/

This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion.

ABataev added inline comments.Jul 11 2019, 7:18 AM
openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29

This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion.

There were not many changes since then, the information is still actual.

29

I fail to reproduce this in a simple test locally. I have a device global variable of struct type and the LLVM-IR declaration I get for it has an undef initializer, not a zeroinitializer. How do/did you verify that it is initialized?

It is handled by the PTX-to-SASS compiler. See this https://docs.nvidia.com/cuda/archive/8.0/parallel-thread-execution/index.html#initializers. It states that "Variables in constant and global state spaces with no explicit initializer are initialized to zero by default."

I'm talking about the queue implementation and all the things mentioned about it.

Yes, it is an immanent part of the tests since many of them rely on the full runtime and just won't work correctly without correct runtime.

So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue?

We're talking about initialization. Other operations in the queue still require atomic operations to implement a lockless queue.

Can you share them?

I can provide you my statistics but I don't think it is going to be very helpful. I started it about a half year ago (when started to work on the code improvement to gain more performance in the compiler and in the runtime) and it is specific to the server I use for the work + the numbers are for cuda 8 only. Do you really need it?

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. Without doubt, this would improve performance as Alexey mentioned and lighten the global memory usage (hundreds of MB to GB). Based on my experiments last year, I think that's doable iff we don't support nested parallelism (see thread on openmp-dev) and possibly "tweak" the spec such that we don't need to track some ICVs once the execution reaches an active parallel region (e.g., we don't need to care about nthreads-var if all nested regions are serialized, but currently the user might query the set value via omp_get_max_threads).

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. Without doubt, this would improve performance as Alexey mentioned and lighten the global memory usage (hundreds of MB to GB). Based on my experiments last year, I think that's doable iff we don't support nested parallelism (see thread on openmp-dev) and possibly "tweak" the spec such that we don't need to track some ICVs once the execution reaches an active parallel region (e.g., we don't need to care about nthreads-var if all nested regions are serialized, but currently the user might query the set value via omp_get_max_threads).

I would try to use lazy initialization for the cases where we need to track some ICVs. I did not try it yet but thought about it for some time.

It is handled by the PTX-to-SASS compiler.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.
Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

Do you really need it?

Do I need/want the code and results of a correct and faster implementation? Yes, please share the results and the code so we can improve things. I don't know why you think I would not want improvements?

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. [...]

I would try to use lazy initialization for the cases where we need to track some ICVs. I did not try it yet but thought about it for some time.

These are great ideas that you should pursue. Though this patch does actually not change anything (important) of the queue implementation and I
think it would be best to open a ticket or email thread on future development instead.


This discussion still seems to go nowhere. Let me summarize:

  • @ABataev has performance results for a correct and improved version of this queue. I'm looking forward to the patches!
  • @Hahnfeld and @ABataev discussed options to eliminate the need for a statue queue all together. I suggest we start a separate discussion on this as it is obviously more involved.
  • I will add initialization to the uninitialized class members.
  • I am not aware of any test case / system configuration under which this patch causes problems. If it does, we split it in two, one to do the refactoring, one to remove volatile.
  • @ABataev mentioned removing volatile could cause problems. I'm hoping for details on the system (Cuda, LLVM, Clang, version numbers) and maybe a reproducer for which it does.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

Do I need/want the code and results of a correct and faster implementation? Yes, please share the results and the code so we can improve things. I don't know why you think I would not want improvements?

The code is not implemented in full yet. I'm doing it step by step. As soon as I commit the patches with threadid fixes etc. I will continue working on code and performance improvement.
I don't have full implementation at the moment as I'm busy with other things, like bug fixing.
The results are just the times of our internal test suite.
Here they are:

real 315.65
user 63.56
sys 104.82

real 310.98
user 62.94
sys 105.07

real 310.91
user 54.78
sys 104.05

real 301.62
user 59.57
sys 105.11

It is the execution results with my previous patches for NVPTX runtime. The last patch I sent for the review gives something about 299 secs real time.

I am not aware of any test case / system configuration under which this patch causes problems. If it does, we split it in two, one to do the refactoring, one to remove volatile.

I'm telling you that volatile stuff is required at least for CUDA 8. Make it a conditional, so the volatile modifier could be applied to cuda 8 at least. Any test with full runtime reveals the problem.
If we're going to drop the support for cuda 8 it is a completely different story. In this case, I think, we can try to remove all the volatile functionality since, seems to me, the problems were fixed in cuda9+. But I'm not sure about this, we need to check this very carefully. I don't want to end up with non-functional runtime that works only from time to time, in some particular systems and only with a limited set of code.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

  1. It is zero initialized, which you claim but which I failed to validate so far.
  2. It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

It is zero initialized, which you claim but which I failed to validate so far.

To be clear, do you mean that you've not yet checked, or that you tried but did not observe the zeroinitializer?

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

  1. It is zero initialized, which you claim but which I failed to validate so far.
  2. It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

  1. It is zero initialized, which you claim but which I failed to validate so far.
  2. It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

So why exactly do you object to adding explicit zero initialization when it would not change a thing? You know, just in case the behavior changes. Quote you:

The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

  1. It is zero initialized, which you claim but which I failed to validate so far.
  2. It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

So why exactly do you object to adding explicit zero initialization when it would not change a thing? You know, just in case the behavior changes. Quote you:

The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.

It is not documented in Cuda itself, but documented in PTX format specification.

Plus, how are you going to initialize fields? Provide a full constructor? It is prohibited for globals in Cuda. All other solutions require use of atomics and performance degradation.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
14

This is interesting. I've just been looking at moving the functions that use inline asm into a target specific region to help code reuse for non-nvptx architectures.

JonChesterfield added inline comments.Aug 6 2019, 9:11 AM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

As it happens, atomicExch works fine for amdgcn. I'm not sure all the atomics are supported.

Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS.

Given some variation in capability between cuda revisions, a compatibility shim that implements at least atomicMax seems a good idea. I'm not sure writing a compatibility layer for all of the possible atomics is within scope - such a thing may already exist elsewhere.

ABataev added inline comments.Aug 6 2019, 9:13 AM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

It works for AMD but may not work for some other platforms. Better to make it target-dependent anyway.

grokos added inline comments.Aug 6 2019, 9:44 AM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS.

Just a correction, 64-bit atomicMax is not available for CUDA_ARCH < 350, the 32-bit version is. There has been an attempt to emulate the 64-bit atomicMax here: https://reviews.llvm.org/D46185 (although that patch was wrong and the author hasn't responded in well over a year).

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19

Ah, nice. Thanks for the detail. Curiously there doesn't appear to be a signed 64 bit atomicMax in cuda (according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

Rocm's tree contains that (revised) patch to nvptx, presumably for whatever sm_30 device we have under test. I was under the impression that our nvptx tree was identical to upstream but actually there's some divergence for me to patch up.

@JonChesterfield @atmnpatel @tianshilei1992 Unclear if this is still needed and/or applies. Feel free to take a look.

target_impl.h is fairly extensive now. There's some debt remaining in how the atomics are handled but it's not causing much harm.

I suspect, but have not proven, that getting rid of volatile qualifiers causes problems for nvptx. Nvidia's atomic model is volatile + fences, which isn't brilliantly compatible with llvm's atomic model. I don't have complete faith in the ptx backend successfully translating atomic semantics into code that ptxas does the right thing with. I'm therefore nervous about changing away from volatile qualifying everything.

The state queue has some limitations. @ronl and @pdhaliwal have spent more looking at it than I have - iirc it reads out of bounds for stack frames above a certain size without diagnostics. The array indexed by smid() doesn't load balance as well for amdgcn as it does for nvptx.

My preference is to delete the state queue entirely. I think it is only used for nested parallelism, which is very slow on gpus whatever we do with it, but there's some semantic problem with just ignoring the nested pragmas. That probably means we can replace the linked stack frame allocated from this state_queue with a compiler transform.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h