This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/nvptx/src/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
-
src/
-
omp_data.cu
-
omptarget-nvptx.cu
18/36
state-queue.h
-
state-queuei.h
2/7
target_impl.h

Differential D64217

[OpenMP][NFCI] Cleanup the target state queue implementation
Needs ReviewPublic

Authored by jdoerfert on Jul 4 2019, 12:42 PM.

Download Raw Diff

Details

Reviewers

openmp-commits

Summary

Note: WIP patch 1/3 to go with a RFC for the device RTL design.

This NFCI patch includes the following cleanup steps:

Merge the state-queue.h and state-queuei.h as the separation was artificial, and the files had a cyclic include dependence anyway.
Remove the now obsolete state-queuei.h.
Adjust the code according to the LLVM coding style, especially wrt. variable and method names.
Change the names of types to be generic, or just less NVPTX specific.
Wrap CUDA specific calls into __kmpc_impl_XXX functions and define them in an own target_impl.h file.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 34390
Build 34389: arc lint + arc unit

Event Timeline

jdoerfert created this revision.Jul 4 2019, 12:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2019, 12:42 PM

Herald added subscribers: jfb, bollu. · View Herald Transcript

Harbormaster completed remote builds in B34390: Diff 208070.Jul 4 2019, 12:45 PM

jdoerfert mentioned this in D64218: [OpenMP][NFCI] Cleanup the target synchronization implementation.Jul 4 2019, 12:48 PM

ABataev added inline comments.Jul 4 2019, 3:41 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	I think it is better to mark it as Cuda, not C/C++, since it contains some Cuda specific constructs.
26	Maybe it is better to name it in LLVM style, without underscores etc.?
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	Why do we need these functions? IMHO, they do not add anything useful, just increase code complexity.

jdoerfert marked 5 inline comments as done.Jul 5 2019, 10:30 AM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	I though I removed all of them, if I haven't we should. The idea is that this "core logic" code should _not_ be CUDA code but something else, e.g., C++ seems natural.
26	I like the idea for for internal types. It will make it more consistent and the external interface also easier to spot.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	They add a level of abstraction around the CUDA implementation, here the CUDA functions `atomicAdd` and `atomicExch`. In the AMD `target_impl.h` the `__kmpc_impl_atomic_exchange` would maybe be implemented with an explicit compare-exchange loop (I'm guessing here).

ABataev added inline comments.Jul 5 2019, 12:35 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	Nope, `INLINE` macro is expanded into cuda specific attributes.

jdoerfert marked 4 inline comments as done.Jul 5 2019, 1:37 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	It is a macro and macros are not specific to CUDA. To what it expands if this file is included from the CUDA code right now is not relevant. This is a first step and we will have to move these files into the common folder (see the RFC). The macro definition will be in the device specific parts and they can then use whatever device (language) specific extensions they want without making these part specific to that device (language) extension. That is, after all, the main motivation behind these patches.

ABataev added inline comments.Jul 5 2019, 4:59 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	But anyway, you need to compile it in cuda or hip mode, not pure c++. I think it is better for tbe developer to understand from the very beginning that this code requires cuda or hip compiler.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 5:16 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	Given that this is "a header file", there is really no telling if "it is" C/C++/Cuda/Hip/OpenCL/Sycl/... All it should require is Clang (potentially in some mode). Given the new OpenMP standard, the mode could reasonably be "-fpoenmp -fopenmp-targets=...".

ABataev added inline comments.Jul 5 2019, 5:35 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	Not necessarily.currently, it can be compiled with nvcc too. It does not produce bc file, but can be linked and works. So, clang is not necessarily a requirement here. What I'm saying that it is better to inform the developer that this code requires cuda or hip compiler, not pure c++ compiler.
29	I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce incorrect code. I would keep it until we drop support for cuda 8. Plus, I would suggest to test it very seriously for other versions of cuda. Does it really works correctly? Ptxas may use some incorrect optimizations without volatile. Though I like the idea of removing them.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 6:05 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	It does not require cuda or hip.
29	These members are accessed only through atomic accesses. Why do we would require volatile in addition?

hfinkel added inline comments.Jul 5 2019, 6:19 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	I think that it's useful to think about this in the following way: We often write code that requires, at least given some configuration of the preprocessor, various extensions to standard C++. CUDA and HIP are C++ extensions. Code using __builtin_frame_address or `__attribute__((always_inline))` or vector intrinsics is code using C++ extensions. If the code can only be compiled with CUDA extensions enabled, then by all means, using .cu is appropriate. If it can be compiled without use of C++ extensions, or with a wide variety of C++ extensions, given different configurations of the preprocessor, then just calling it a "C++" source file seems reasonable.

ABataev added inline comments.Jul 5 2019, 6:33 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	Hal, I agree, but this code cannot be compiled without cuda correctly. You need to generate the device code for these functions and because of this they must be marked as `__device__` functions. This is cuda specific modifier that tells the compiler that we don't need the host version of this code, just the device part. But if you think that this is not important, ok, mark it as c++.

ABataev added inline comments.Jul 5 2019, 6:35 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Without `volatile` they were not initialized properly on cuda 8. This, again, seems to me like some kind of a bug in ptxas for cuda 8. Not sure about this problem in cuda 9, it requires some additional testing.

hfinkel added inline comments.Jul 5 2019, 6:41 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	I see "if compiling as CUDA, we need the functions marked with `__device__`" as very similar to saying "if we compile into a Windows DLL we need the functions marked with `__declspec(dllexport)`". It's fine to mark it as C++ because, aside from the ABI modifiers, it's essentially "just" C++ code.

jdoerfert marked 2 inline comments as done.Jul 5 2019, 7:10 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
1	I mentioned this already but I will do it again: There is also no other CUDA code in this file. Especially, there is no occurrence of `__device__` in this file. If put into a context, `INLINE` could resolve to something that includes `__device__` but it could also resolve to something without `__devicde__`.
29	I actually did not notice that there is apparently no initialization of these values (or did I miss it?). If there is none, than that is the problem which needs fixing, e.g., `uint32_t Head = 0`.

ABataev added inline comments.Jul 5 2019, 7:26 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	They are initialized, but without volatile qualifier they are not initialized in the compiled code.

jdoerfert marked an inline comment as done.Jul 5 2019, 7:34 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I checked and I could not find the initialization of these variables anywhere. We should make sure the problem is not the UB of not initializing them before we require `volatile` here.

ABataev added inline comments.Jul 6 2019, 5:37 AM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8. I'm trying to reduce the use of this structure with my patches, because, most probably, it relies on some undocumented features. Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table.

jdoerfert marked an inline comment as done.Jul 6 2019, 10:41 AM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8. I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here? Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table. That is a question for a follow up patch. I did not try to change the semantics in these ones, just the coding style and organization of the code.

ABataev added inline comments.Jul 6 2019, 11:46 AM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here? Read this https://docs.nvidia.com/cuda/archive/10.1/cuda-c-programming-guide/index.html#device-memory-specifiers

jdoerfert marked an inline comment as done.Jul 9 2019, 2:02 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I fail to find the part that says they are initialized to zero. Could you specify where I can find that?

ABataev added inline comments.Jul 9 2019, 2:08 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	There is mo such part. It is implementation feature this class relies on, as I understand. I'm trying to reduce the number of such things in the library.

jdoerfert marked an inline comment as done.Jul 9 2019, 3:33 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow "makes it work". Agreed? Couldn't we simply introduce the initialization and get rid of the volatile as all accesses (that are actually there) happen atomically anyway?

ABataev added inline comments.Jul 9 2019, 4:23 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix. Atomic operations are not the best choice. They actually sequentialize the execution and it leads to significant performance degradation. Volatile modifier allows avoiding it. I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.

jdoerfert marked an inline comment as done.Jul 9 2019, 11:48 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"? Atomic operations are not the best choice. They actually sequentialize the execution That is not true as a general statement, e.g., wrt. non-queue related operations, and I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. and it leads to significant performance degradation. Volatile modifier allows avoiding it. What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses? I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8. Did you test the state-queue implementation without `volatile` but with explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.

ABataev added inline comments.Jul 10 2019, 6:58 AM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"? Just like I said, I also don't like it. But the atomic alternative is much worse. That is not true as a general statement, e.g., wrt. non-queue related operations, and It is true if several threads are trying to access the same memory location. This is exactly our situation. 2.I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. And we do it atomically. And it leads to siginificant performance degradation. Why do you think the full runtime mode/general mode is much slower than the SPMD mode without runtime? Mostly, because of the atomic operations and lockless structures. What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses? Not exactly with the queue itself (though, we have an atomic loop in the queue and it makes the performance significantly worse). We had the same situation with teams reductions. We had to change the scheme, because the performance was very poor (again, atomics). You can try to seearch bugzilla, there must be an error report about poor performance of teams reductions and a fix that changes the scheme to reduce the number of atomic operations. Our problem here is that cuda does not have block/teams synchronizaion primitives. If we could have block/teams effective sync primitives we could solve this problem easily. Without them, we can use only atomic operations for synchronization and a loop. And we end up with poor performance immediately. Did you test the state-queue implementation without volatile but with explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning. I tried it some time ago. After some testing I decided that it would be better to try to reduce the use of this queue instead if want to get the performance. You can to try to implement direct initialization. You can try any test that requires full runtime mode (does not matter, SPMD or non-SPMD constructs) since you jst want to change the initialization.

jdoerfert marked an inline comment as done.Jul 10 2019, 4:13 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress. Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this? Did you (recently) run any tests, correctness or performance, related to this queue? the use of atomic here? the use of volatile here? the missing initialization here? How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?

ABataev added inline comments.Jul 10 2019, 5:35 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress. Agree. Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this? It initializes the global data with zero. Cuda is LLVM based and relies on the LLVM IR restrictions, I assume. These restrictions make it to zeroinitialize the whole global data. It was not me who decided to rely on this implementation defined behavior, I'm personally not a big fan of it. Did you (recently) run any tests, correctness or performance, related to this queue? Almost everytime before sending patch for a review or committing it. the use of atomic here? Not exactly with this structure but with the similar ones, like teams reductions buffers, memory management hash table etc. the use of volatile here? Yes, and least for cuda 8 it is required to have volatile here. Not sure about cuda 9+. the missing initialization here? Again, almost everytime when I submit a patch. How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...? I have a statistics for the test runs with the atomics and without. These statistics results shows everything. We have different tests, the ones with big number of registers usage and small number. I'm personally using nvprof and time and many runs to make the statistics more stable.

ABataev added inline comments.Jul 10 2019, 6:05 PM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/

jdoerfert marked 2 inline comments as done.Jul 10 2019, 7:06 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	It initializes the global data with zero. I fail to reproduce this in a simple test locally. I have a `__device__` global variable of struct type and the LLVM-IR declaration I get for it has an `undef` initializer, not a `zeroinitializer`. How do/did you verify that it is initialized? the missing initialization here? Again, almost everytime when I submit a patch. I'm talking about the queue implementation and all the things mentioned about it. I have a statistics for the test runs with the atomics and without. So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue? These statistics results shows everything. Can you share them?
29	Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/ This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion.

ABataev added inline comments.Jul 11 2019, 7:18 AM

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h
29	This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion. There were not many changes since then, the information is still actual.
29	I fail to reproduce this in a simple test locally. I have a device global variable of struct type and the LLVM-IR declaration I get for it has an undef initializer, not a zeroinitializer. How do/did you verify that it is initialized? It is handled by the PTX-to-SASS compiler. See this https://docs.nvidia.com/cuda/archive/8.0/parallel-thread-execution/index.html#initializers. It states that "Variables in constant and global state spaces with no explicit initializer are initialized to zero by default." I'm talking about the queue implementation and all the things mentioned about it. Yes, it is an immanent part of the tests since many of them rely on the full runtime and just won't work correctly without correct runtime. So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue? We're talking about initialization. Other operations in the queue still require atomic operations to implement a lockless queue. Can you share them? I can provide you my statistics but I don't think it is going to be very helpful. I started it about a half year ago (when started to work on the code improvement to gain more performance in the compiler and in the runtime) and it is specific to the server I use for the work + the numbers are for cuda 8 only. Do you really need it?

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. Without doubt, this would improve performance as Alexey mentioned and lighten the global memory usage (hundreds of MB to GB). Based on my experiments last year, I think that's doable iff we don't support nested parallelism (see thread on openmp-dev) and possibly "tweak" the spec such that we don't need to track some ICVs once the execution reaches an active parallel region (e.g., we don't need to care about nthreads-var if all nested regions are serialized, but currently the user might query the set value via omp_get_max_threads).

In D64217#1580895, @Hahnfeld wrote:

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. Without doubt, this would improve performance as Alexey mentioned and lighten the global memory usage (hundreds of MB to GB). Based on my experiments last year, I think that's doable iff we don't support nested parallelism (see thread on openmp-dev) and possibly "tweak" the spec such that we don't need to track some ICVs once the execution reaches an active parallel region (e.g., we don't need to care about nthreads-var if all nested regions are serialized, but currently the user might query the set value via omp_get_max_threads).

I would try to use lazy initialization for the cases where we need to track some ICVs. I did not try it yet but thought about it for some time.

It is handled by the PTX-to-SASS compiler.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.
Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

Do you really need it?

Do I need/want the code and results of a correct and faster implementation? Yes, please share the results and the code so we can improve things. I don't know why you think I would not want improvements?

Ideally, we can try to get rid of the target state queue by reducing the amount of information and using only shared memory. [...]

I would try to use lazy initialization for the cases where we need to track some ICVs. I did not try it yet but thought about it for some time.

These are great ideas that you should pursue. Though this patch does actually not change anything (important) of the queue implementation and I
think it would be best to open a ticket or email thread on future development instead.

This discussion still seems to go nowhere. Let me summarize:

@ABataev has performance results for a correct and improved version of this queue. I'm looking forward to the patches!
@Hahnfeld and @ABataev discussed options to eliminate the need for a statue queue all together. I suggest we start a separate discussion on this as it is obviously more involved.
I will add initialization to the uninitialized class members.
I am not aware of any test case / system configuration under which this patch causes problems. If it does, we split it in two, one to do the refactoring, one to remove volatile.
@ABataev mentioned removing volatile could cause problems. I'm hoping for details on the system (Cuda, LLVM, Clang, version numbers) and maybe a reproducer for which it does.

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

Do I need/want the code and results of a correct and faster implementation? Yes, please share the results and the code so we can improve things. I don't know why you think I would not want improvements?

The code is not implemented in full yet. I'm doing it step by step. As soon as I commit the patches with threadid fixes etc. I will continue working on code and performance improvement.
I don't have full implementation at the moment as I'm busy with other things, like bug fixing.
The results are just the times of our internal test suite.
Here they are:

real 315.65
user 63.56
sys 104.82

real 310.98
user 62.94
sys 105.07

real 310.91
user 54.78
sys 104.05

real 301.62
user 59.57
sys 105.11

It is the execution results with my previous patches for NVPTX runtime. The last patch I sent for the review gives something about 299 secs real time.

I am not aware of any test case / system configuration under which this patch causes problems. If it does, we split it in two, one to do the refactoring, one to remove volatile.

I'm telling you that volatile stuff is required at least for CUDA 8. Make it a conditional, so the volatile modifier could be applied to cuda 8 at least. Any test with full runtime reveals the problem.
If we're going to drop the support for cuda 8 it is a completely different story. In this case, I think, we can try to remove all the volatile functionality since, seems to me, the problems were fixed in cuda9+. But I'm not sure about this, we need to check this very carefully. I don't want to end up with non-functional runtime that works only from time to time, in some particular systems and only with a limited set of code.

In D64217#1591812, @ABataev wrote:

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

It is zero initialized, which you claim but which I failed to validate so far.
It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

It is zero initialized, which you claim but which I failed to validate so far.

To be clear, do you mean that you've not yet checked, or that you tried but did not observe the zeroinitializer?

In D64217#1596480, @jdoerfert wrote:

In D64217#1591812, @ABataev wrote:

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

It is zero initialized, which you claim but which I failed to validate so far.

It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

In D64217#1596541, @ABataev wrote:

In D64217#1596480, @jdoerfert wrote:

In D64217#1591812, @ABataev wrote:

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

It is zero initialized, which you claim but which I failed to validate so far.

It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

So why exactly do you object to adding explicit zero initialization when it would not change a thing? You know, just in case the behavior changes. Quote you:

The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.

In D64217#1596819, @jdoerfert wrote:

In D64217#1596541, @ABataev wrote:

In D64217#1596480, @jdoerfert wrote:

In D64217#1591812, @ABataev wrote:

If they are uninitialized in LLVM-IR, it doesn't matter what happens at the PTX level and lower, LLVM could simply get rid of the accesses.

They are initialized in LLVM IR, by zeroinitializer, they are not initialized in PTX file explicitly. But it is ok for PTX since, according to the documents, the memory implicitly initialized by zeros.

Why do you fight so hard against zero-cost initialization that *might be* needed for correctness?

I'm not fighting against zero-cost initialization. Just it is not possible to implement zero-cost initialization.

There are only two possibilities:

It is zero initialized, which you claim but which I failed to validate so far.

It is not zero initialized.

In case 1, initializing with zero is free as it is already happening. We cannot have two zero initializers after all.
In case 2, initializing with zero is *necessary* as it can otherwise be transformed by LLVM in a way that would break the behavior we need.

Short answer - it is zero initialized.
Long answer.
In LLVM IR form it is emitted with zeroinitializer. When this LLVM IR is compiled to ptx, this zeroinitializer is dropped. But it is ok, since, according to PTX format, the global variables are zeroinitialized by default, if they don't have intializers.

So why exactly do you object to adding explicit zero initialization when it would not change a thing? You know, just in case the behavior changes. Quote you:

The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.

It is not documented in Cuda itself, but documented in PTX format specification.

Plus, how are you going to initialize fields? Provide a full constructor? It is prohibited for globals in Cuda. All other solutions require use of atomics and performance degradation.

JonChesterfield added a subscriber: JonChesterfield.Jul 24 2019, 7:56 AM

JonChesterfield added inline comments.Jul 24 2019, 8:03 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
14	This is interesting. I've just been looking at moving the functions that use inline asm into a target specific region to help code reuse for non-nvptx architectures.

JonChesterfield added inline comments.Aug 6 2019, 9:11 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	As it happens, atomicExch works fine for amdgcn. I'm not sure all the atomics are supported. Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS. Given some variation in capability between cuda revisions, a compatibility shim that implements at least atomicMax seems a good idea. I'm not sure writing a compatibility layer for all of the possible atomics is within scope - such a thing may already exist elsewhere.

ABataev added inline comments.Aug 6 2019, 9:13 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	It works for AMD but may not work for some other platforms. Better to make it target-dependent anyway.

grokos added inline comments.Aug 6 2019, 9:44 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS. Just a correction, 64-bit `atomicMax` is not available for CUDA_ARCH < 350, the 32-bit version is. There has been an attempt to emulate the 64-bit `atomicMax` here: https://reviews.llvm.org/D46185 (although that patch was wrong and the author hasn't responded in well over a year).

JonChesterfield added inline comments.Aug 6 2019, 10:32 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19	Ah, nice. Thanks for the detail. Curiously there doesn't appear to be a signed 64 bit atomicMax in cuda (according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) Rocm's tree contains that (revised) patch to nvptx, presumably for whatever sm_30 device we have under test. I was under the impression that our nvptx tree was identical to upstream but actually there's some divergence for me to patch up.

JonChesterfield mentioned this in D65836: Factor architecture dependent code out of loop.cu.Aug 6 2019, 5:14 PM

JonChesterfield mentioned this in rL368751: Factor architecture dependent code out of loop.cu.Aug 13 2019, 2:42 PM

JonChesterfield mentioned this in rGed3324f6b6e3: Factor architecture dependent code out of loop.cu.

jdoerfert mentioned this in D71404: [libomptarget][nfc] Introduce atomic wrapper function.Dec 12 2019, 5:03 PM

@JonChesterfield @atmnpatel @tianshilei1992 Unclear if this is still needed and/or applies. Feel free to take a look.

Herald added subscribers: sstefan1, yaxunl. · View Herald TranscriptDec 15 2020, 10:08 AM

target_impl.h is fairly extensive now. There's some debt remaining in how the atomics are handled but it's not causing much harm.

I suspect, but have not proven, that getting rid of volatile qualifiers causes problems for nvptx. Nvidia's atomic model is volatile + fences, which isn't brilliantly compatible with llvm's atomic model. I don't have complete faith in the ptx backend successfully translating atomic semantics into code that ptxas does the right thing with. I'm therefore nervous about changing away from volatile qualifying everything.

The state queue has some limitations. @ronl and @pdhaliwal have spent more looking at it than I have - iirc it reads out of bounds for stack frames above a certain size without diagnostics. The array indexed by smid() doesn't load balance as well for amdgcn as it does for nvptx.

My preference is to delete the state queue entirely. I think it is only used for nested parallelism, which is very slow on gpus whatever we do with it, but there's some semantic problem with just ignoring the nested pragmas. That probably means we can replace the linked stack frame allocated from this state_queue with a compiler transform.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

src/

2 lines

2 lines

116 lines

29 lines

Diff 208070

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu

	Show All 17 Lines

	__device__ omptarget_device_environmentTy omptarget_device_environment;			__device__ omptarget_device_environmentTy omptarget_device_environment;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// global data holding OpenMP state information			// global data holding OpenMP state information
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	__device__			__device__
	omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>			omptarget_state_queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>
	omptarget_nvptx_device_State[MAX_SM];			omptarget_nvptx_device_State[MAX_SM];

	__device__ omptarget_nvptx_SimpleMemoryManager			__device__ omptarget_nvptx_SimpleMemoryManager
	omptarget_nvptx_simpleMemoryManager;			omptarget_nvptx_simpleMemoryManager;
	__device__ __shared__ uint32_t usedMemIdx;			__device__ __shared__ uint32_t usedMemIdx;
	__device__ __shared__ uint32_t usedSlotIdx;			__device__ __shared__ uint32_t usedSlotIdx;

	__device__ __shared__ uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];			__device__ __shared__ uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
	Show All 33 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu

	Show All 11 Lines

	#include "omptarget-nvptx.h"			#include "omptarget-nvptx.h"

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// global data tables			// global data tables
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	extern __device__			extern __device__
	omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>			omptarget_state_queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>
	omptarget_nvptx_device_State[MAX_SM];			omptarget_nvptx_device_State[MAX_SM];

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// init entry points			// init entry points
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	INLINE static unsigned smid() {			INLINE static unsigned smid() {
	unsigned id;			unsigned id;
	▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h

	//===--------- statequeue.h - NVPTX OpenMP GPU State Queue ------- CUDA -*-===//			//===--- state-queue.h --- OpenMP target state queue ------------- C++ --===//
				ABataevUnsubmitted Done Reply Inline Actions I think it is better to mark it as Cuda, not C/C++, since it contains some Cuda specific constructs. ABataev: I think it is better to mark it as Cuda, not C/C++, since it contains some Cuda specific…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions I though I removed all of them, if I haven't we should. The idea is that this "core logic" code should _not_ be CUDA code but something else, e.g., C++ seems natural. jdoerfert: I though I removed all of them, if I haven't we should. The idea is that this "core logic" code…
				ABataevUnsubmitted Done Reply Inline Actions Nope, `INLINE` macro is expanded into cuda specific attributes. ABataev: Nope, `INLINE` macro is expanded into cuda specific attributes.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions It is a macro and macros are not specific to CUDA. To what it expands if this file is included from the CUDA code right now is not relevant. This is a first step and we will have to move these files into the common folder (see the RFC). The macro definition will be in the device specific parts and they can then use whatever device (language) specific extensions they want without making these part specific to that device (language) extension. That is, after all, the main motivation behind these patches. jdoerfert: It is a macro and macros are not specific to CUDA. To what it expands if this file is included…
				ABataevUnsubmitted Not Done Reply Inline Actions But anyway, you need to compile it in cuda or hip mode, not pure c++. I think it is better for tbe developer to understand from the very beginning that this code requires cuda or hip compiler. ABataev: But anyway, you need to compile it in cuda or hip mode, not pure c++. I think it is better for…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Given that this is "a header file", there is really no telling if "it is" C/C++/Cuda/Hip/OpenCL/Sycl/... All it should require is Clang (potentially in some mode). Given the new OpenMP standard, the mode could reasonably be "-fpoenmp -fopenmp-targets=...". jdoerfert: Given that this is "a header file", there is really no telling if "it is"…
				ABataevUnsubmitted Not Done Reply Inline Actions Not necessarily.currently, it can be compiled with nvcc too. It does not produce bc file, but can be linked and works. So, clang is not necessarily a requirement here. What I'm saying that it is better to inform the developer that this code requires cuda or hip compiler, not pure c++ compiler. ABataev: Not necessarily.currently, it can be compiled with nvcc too. It does not produce bc file, but…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions It does not require cuda or hip. jdoerfert: It does not require cuda or hip.
				hfinkelUnsubmitted Not Done Reply Inline Actions I think that it's useful to think about this in the following way: We often write code that requires, at least given some configuration of the preprocessor, various extensions to standard C++. CUDA and HIP are C++ extensions. Code using __builtin_frame_address or `__attribute__((always_inline))` or vector intrinsics is code using C++ extensions. If the code can only be compiled with CUDA extensions enabled, then by all means, using .cu is appropriate. If it can be compiled without use of C++ extensions, or with a wide variety of C++ extensions, given different configurations of the preprocessor, then just calling it a "C++" source file seems reasonable. hfinkel: I think that it's useful to think about this in the following way: We often write code that…
				ABataevUnsubmitted Not Done Reply Inline Actions Hal, I agree, but this code cannot be compiled without cuda correctly. You need to generate the device code for these functions and because of this they must be marked as `__device__` functions. This is cuda specific modifier that tells the compiler that we don't need the host version of this code, just the device part. But if you think that this is not important, ok, mark it as c++. ABataev: Hal, I agree, but this code cannot be compiled without cuda correctly. You need to generate…
				hfinkelUnsubmitted Not Done Reply Inline Actions I see "if compiling as CUDA, we need the functions marked with `__device__`" as very similar to saying "if we compile into a Windows DLL we need the functions marked with `__declspec(dllexport)`". It's fine to mark it as C++ because, aside from the ABI modifiers, it's essentially "just" C++ code. hfinkel: I see "if compiling as CUDA, we need the functions marked with `__device__`" as very similar…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions I mentioned this already but I will do it again: There is also no other CUDA code in this file. Especially, there is no occurrence of `__device__` in this file. If put into a context, `INLINE` could resolve to something that includes `__device__` but it could also resolve to something without `__devicde__`. jdoerfert: I mentioned this already but I will do it again: There is also no other CUDA code in this file.
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains a queue to hand out OpenMP state objects to teams of			// This file provides a queue to hand out OpenMP state objects to teams of one
	// one or more kernels.			// or more kernels.
	//			//
	// Reference:			// Reference:
	// Thomas R.W. Scogland and Wu-chun Feng. 2015.			// Thomas R.W. Scogland and Wu-chun Feng. 2015.
	// Design and Evaluation of Scalable Concurrent Queues for Many-Core			// Design and Evaluation of Scalable Concurrent Queues for Many-Core
	// Architectures. International Conference on Performance Engineering.			// Architectures. International Conference on Performance Engineering.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef __STATE_QUEUE_H			#ifndef STATE_QUEUE_H
	#define __STATE_QUEUE_H			#define STATE_QUEUE_H

	#include <stdint.h>			#include "target_impl.h"

	#include "option.h" // choices we have			#include <stdint.h>

	template <typename ElementType, uint32_t SIZE> class omptarget_nvptx_Queue {			template <typename ElementType, uint32_t SIZE> class omptarget_state_queue {
				ABataevUnsubmitted Not Done Reply Inline Actions Maybe it is better to name it in LLVM style, without underscores etc.? ABataev: Maybe it is better to name it in LLVM style, without underscores etc.?
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions I like the idea for for internal types. It will make it more consistent and the external interface also easier to spot. jdoerfert: I like the idea for for internal types. It will make it more consistent and the external…
	private:			ElementType Elements[SIZE];
	ElementType elements[SIZE];			ElementType *ElementQueue[SIZE];
	volatile ElementType *elementQueue[SIZE];			uint32_t Head;
	ABataevUnsubmitted Not Done Reply Inline Actions I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce incorrect code. I would keep it until we drop support for cuda 8. Plus, I would suggest to test it very seriously for other versions of cuda. Does it really works correctly? Ptxas may use some incorrect optimizations without volatile. Though I like the idea of removing them. ABataev: I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions These members are accessed only through atomic accesses. Why do we would require volatile in addition? jdoerfert: These members are accessed only through atomic accesses. Why do we would require volatile in…
	ABataevUnsubmitted Not Done Reply Inline Actions Without `volatile` they were not initialized properly on cuda 8. This, again, seems to me like some kind of a bug in ptxas for cuda 8. Not sure about this problem in cuda 9, it requires some additional testing. ABataev: Without `volatile` they were not initialized properly on cuda 8. This, again, seems to me like…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions I actually did not notice that there is apparently no initialization of these values (or did I miss it?). If there is none, than that is the problem which needs fixing, e.g., `uint32_t Head = 0`. jdoerfert: I actually did not notice that there is apparently no initialization of these values (or did I…
	ABataevUnsubmitted Not Done Reply Inline Actions They are initialized, but without volatile qualifier they are not initialized in the compiled code. ABataev: They are initialized, but without volatile qualifier they are not initialized in the compiled…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions I checked and I could not find the initialization of these variables anywhere. We should make sure the problem is not the UB of not initializing them before we require `volatile` here. jdoerfert: I checked and I could not find the initialization of these variables anywhere. We should make…
	ABataevUnsubmitted Not Done Reply Inline Actions I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8. I'm trying to reduce the use of this structure with my patches, because, most probably, it relies on some undocumented features. Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table. ABataev: I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8. I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here? Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table. That is a question for a follow up patch. I did not try to change the semantics in these ones, just the coding style and organization of the code. jdoerfert: > I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but…
	ABataevUnsubmitted Not Done Reply Inline Actions I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here? Read this https://docs.nvidia.com/cuda/archive/10.1/cuda-c-programming-guide/index.html#device-memory-specifiers ABataev: > I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions I fail to find the part that says they are initialized to zero. Could you specify where I can find that? jdoerfert: I fail to find the part that says they are initialized to zero. Could you specify where I can…
	ABataevUnsubmitted Not Done Reply Inline Actions There is mo such part. It is implementation feature this class relies on, as I understand. I'm trying to reduce the number of such things in the library. ABataev: There is mo such part. It is implementation feature this class relies on, as I understand. I'm…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow "makes it work". Agreed? Couldn't we simply introduce the initialization and get rid of the volatile as all accesses (that are actually there) happen atomically anyway? jdoerfert: I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow…
	ABataevUnsubmitted Not Done Reply Inline Actions Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix. Atomic operations are not the best choice. They actually sequentialize the execution and it leads to significant performance degradation. Volatile modifier allows avoiding it. I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8. ABataev: Not quite. The value is zero-initialized by default. This is not documented, this is…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"? Atomic operations are not the best choice. They actually sequentialize the execution That is not true as a general statement, e.g., wrt. non-queue related operations, and I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. and it leads to significant performance degradation. Volatile modifier allows avoiding it. What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses? I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8. Did you test the state-queue implementation without `volatile` but with explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning. jdoerfert: > Not quite. The value is zero-initialized by default. This is not documented, this is…
	ABataevUnsubmitted Not Done Reply Inline Actions How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"? Just like I said, I also don't like it. But the atomic alternative is much worse. That is not true as a general statement, e.g., wrt. non-queue related operations, and It is true if several threads are trying to access the same memory location. This is exactly our situation. 2.I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. And we do it atomically. And it leads to siginificant performance degradation. Why do you think the full runtime mode/general mode is much slower than the SPMD mode without runtime? Mostly, because of the atomic operations and lockless structures. What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses? Not exactly with the queue itself (though, we have an atomic loop in the queue and it makes the performance significantly worse). We had the same situation with teams reductions. We had to change the scheme, because the performance was very poor (again, atomics). You can try to seearch bugzilla, there must be an error report about poor performance of teams reductions and a fix that changes the scheme to reduce the number of atomic operations. Our problem here is that cuda does not have block/teams synchronizaion primitives. If we could have block/teams effective sync primitives we could solve this problem easily. Without them, we can use only atomic operations for synchronization and a loop. And we end up with poor performance immediately. Did you test the state-queue implementation without volatile but with explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning. I tried it some time ago. After some testing I decided that it would be better to try to reduce the use of this queue instead if want to get the performance. You can to try to implement direct initialization. You can try any test that requires full runtime mode (does not matter, SPMD or non-SPMD constructs) since you jst want to change the initialization. ABataev: > How does one learn of this undocumented feature, how can one test for it, how is one save of…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress. Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this? Did you (recently) run any tests, correctness or performance, related to this queue? the use of atomic here? the use of volatile here? the missing initialization here? How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...? jdoerfert: Let us stop have these very long, very abstract conversations in code reviews that seem to lead…
	ABataevUnsubmitted Not Done Reply Inline Actions Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress. Agree. Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this? It initializes the global data with zero. Cuda is LLVM based and relies on the LLVM IR restrictions, I assume. These restrictions make it to zeroinitialize the whole global data. It was not me who decided to rely on this implementation defined behavior, I'm personally not a big fan of it. Did you (recently) run any tests, correctness or performance, related to this queue? Almost everytime before sending patch for a review or committing it. the use of atomic here? Not exactly with this structure but with the similar ones, like teams reductions buffers, memory management hash table etc. the use of volatile here? Yes, and least for cuda 8 it is required to have volatile here. Not sure about cuda 9+. the missing initialization here? Again, almost everytime when I submit a patch. How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...? I have a statistics for the test runs with the atomics and without. These statistics results shows everything. We have different tests, the ones with big number of registers usage and small number. I'm personally using nvprof and time and many runs to make the statistics more stable. ABataev: > Let us stop have these very long, very abstract conversations in code reviews that seem to…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions It initializes the global data with zero. I fail to reproduce this in a simple test locally. I have a `__device__` global variable of struct type and the LLVM-IR declaration I get for it has an `undef` initializer, not a `zeroinitializer`. How do/did you verify that it is initialized? the missing initialization here? Again, almost everytime when I submit a patch. I'm talking about the queue implementation and all the things mentioned about it. I have a statistics for the test runs with the atomics and without. So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue? These statistics results shows everything. Can you share them? jdoerfert: > It initializes the global data with zero. I fail to reproduce this in a simple test locally.
	ABataevUnsubmitted Not Done Reply Inline Actions I fail to reproduce this in a simple test locally. I have a device global variable of struct type and the LLVM-IR declaration I get for it has an undef initializer, not a zeroinitializer. How do/did you verify that it is initialized? It is handled by the PTX-to-SASS compiler. See this https://docs.nvidia.com/cuda/archive/8.0/parallel-thread-execution/index.html#initializers. It states that "Variables in constant and global state spaces with no explicit initializer are initialized to zero by default." I'm talking about the queue implementation and all the things mentioned about it. Yes, it is an immanent part of the tests since many of them rely on the full runtime and just won't work correctly without correct runtime. So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue? We're talking about initialization. Other operations in the queue still require atomic operations to implement a lockless queue. Can you share them? I can provide you my statistics but I don't think it is going to be very helpful. I started it about a half year ago (when started to work on the code improvement to gain more performance in the compiler and in the runtime) and it is specific to the server I use for the work + the numbers are for cuda 8 only. Do you really need it? ABataev: > I fail to reproduce this in a simple test locally. I have a __device__ global variable of…
	ABataevUnsubmitted Not Done Reply Inline Actions Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/ ABataev: Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic…
	jdoerfertAuthorUnsubmitted Done Reply Inline Actions Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/ This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion. jdoerfert: > Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic…
	ABataevUnsubmitted Not Done Reply Inline Actions This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion. There were not many changes since then, the information is still actual. ABataev: > This seems to be a micro-benchmark evaulation that stress test atomics operations. It was…
	volatile uint32_t head;			uint32_t Tail;
	volatile uint32_t ids[SIZE];			uint32_t Ids[SIZE];
	volatile uint32_t tail;

	static const uint32_t MAX_ID = (1u << 31) / SIZE / 2;			static const uint32_t MAX_ID = (1u << 31) / SIZE / 2;
	INLINE uint32_t ENQUEUE_TICKET();			INLINE uint32_t enqueueTicket();
	INLINE uint32_t DEQUEUE_TICKET();			INLINE uint32_t dequeueTicket();
	INLINE static uint32_t ID(uint32_t ticket);			INLINE static uint32_t getID(uint32_t Ticket);
	INLINE bool IsServing(uint32_t slot, uint32_t id);			INLINE bool isServing(uint32_t Slot, uint32_t ID);
	INLINE void PushElement(uint32_t slot, ElementType *element);			INLINE void pushElement(uint32_t Slot, ElementType *element);
	INLINE ElementType *PopElement(uint32_t slot);			INLINE ElementType *popElement(uint32_t Slot);
	INLINE void DoneServing(uint32_t slot, uint32_t id);			INLINE void doneServing(uint32_t Slot, uint32_t ID);

	public:			public:
	INLINE omptarget_nvptx_Queue() {}			INLINE void enqueue(ElementType *element);
	INLINE void Enqueue(ElementType *element);			INLINE ElementType *dequeue();
	INLINE ElementType *Dequeue();
	};			};

	#include "state-queuei.h"			template <typename ElementType, uint32_t SIZE>
				INLINE uint32_t omptarget_state_queue<ElementType, SIZE>::enqueueTicket() {
				return __kmpc_impl_atomic_add(&Tail, 1);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE uint32_t omptarget_state_queue<ElementType, SIZE>::dequeueTicket() {
				return __kmpc_impl_atomic_add(&Head, 1);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE uint32_t
				omptarget_state_queue<ElementType, SIZE>::getID(uint32_t Ticket) {
				return (Ticket / SIZE) * 2;
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE bool omptarget_state_queue<ElementType, SIZE>::isServing(uint32_t Slot,
				uint32_t ID) {
				return __kmpc_impl_atomic_add(&Ids[Slot], 0) == ID;
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE void
				omptarget_state_queue<ElementType, SIZE>::pushElement(uint32_t Slot,
				ElementType *element) {
				__kmpc_impl_atomic_exchange(&ElementQueue[Slot], element);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE ElementType *
				omptarget_state_queue<ElementType, SIZE>::popElement(uint32_t Slot) {
				return (ElementType *)__kmpc_impl_atomic_add(&ElementQueue[Slot], 0);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE void omptarget_state_queue<ElementType, SIZE>::doneServing(uint32_t Slot,
				uint32_t ID) {
				__kmpc_impl_atomic_exchange(&Ids[Slot], (ID + 1) % MAX_ID);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE void
				omptarget_state_queue<ElementType, SIZE>::enqueue(ElementType *element) {
				uint32_t Ticket = enqueueTicket();
				uint32_t Slot = Ticket % SIZE;
				uint32_t ID = getID(Ticket) + 1;
				while (!isServing(Slot, ID))
				;
				pushElement(Slot, element);
				doneServing(Slot, ID);
				}

				template <typename ElementType, uint32_t SIZE>
				INLINE ElementType *omptarget_state_queue<ElementType, SIZE>::dequeue() {
				uint32_t Ticket = dequeueTicket();
				uint32_t Slot = Ticket % SIZE;
				uint32_t ID = getID(Ticket);
				while (!isServing(Slot, ID))
				;
				ElementType *element = popElement(Slot);
				// This is to populate the queue because of the lack of GPU constructors.
				if (element == 0)
				element = &Elements[Slot];
				doneServing(Slot, ID);
				return element;
				}

	#endif			#endif // STATE_QUEUE_H

openmp/libomptarget/deviceRTLs/nvptx/src/state-queuei.h

This file was deleted.

	//===------- state-queue.cu - NVPTX OpenMP GPU State Queue ------- CUDA -*-===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//
	//
	// This file contains the implementation of a queue to hand out OpenMP state
	// objects to teams of one or more kernels.
	//
	// Reference:
	// Thomas R.W. Scogland and Wu-chun Feng. 2015.
	// Design and Evaluation of Scalable Concurrent Queues for Many-Core
	// Architectures. International Conference on Performance Engineering.
	//
	//===----------------------------------------------------------------------===//

	#include "state-queue.h"

	template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::ENQUEUE_TICKET() {
	return atomicAdd((unsigned int *)&tail, 1);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::DEQUEUE_TICKET() {
	return atomicAdd((unsigned int *)&head, 1);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t
	omptarget_nvptx_Queue<ElementType, SIZE>::ID(uint32_t ticket) {
	return (ticket / SIZE) * 2;
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE bool omptarget_nvptx_Queue<ElementType, SIZE>::IsServing(uint32_t slot,
	uint32_t id) {
	return atomicAdd((unsigned int *)&ids[slot], 0) == id;
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE void
	omptarget_nvptx_Queue<ElementType, SIZE>::PushElement(uint32_t slot,
	ElementType *element) {
	atomicExch((unsigned long long *)&elementQueue[slot],
	(unsigned long long)element);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE ElementType *
	omptarget_nvptx_Queue<ElementType, SIZE>::PopElement(uint32_t slot) {
	return (ElementType )atomicAdd((unsigned long long )&elementQueue[slot],
	(unsigned long long)0);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE void omptarget_nvptx_Queue<ElementType, SIZE>::DoneServing(uint32_t slot,
	uint32_t id) {
	atomicExch((unsigned int *)&ids[slot], (id + 1) % MAX_ID);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE void
	omptarget_nvptx_Queue<ElementType, SIZE>::Enqueue(ElementType *element) {
	uint32_t ticket = ENQUEUE_TICKET();
	uint32_t slot = ticket % SIZE;
	uint32_t id = ID(ticket) + 1;
	while (!IsServing(slot, id))
	;
	PushElement(slot, element);
	DoneServing(slot, id);
	}

	template <typename ElementType, uint32_t SIZE>
	INLINE ElementType *omptarget_nvptx_Queue<ElementType, SIZE>::Dequeue() {
	uint32_t ticket = DEQUEUE_TICKET();
	uint32_t slot = ticket % SIZE;
	uint32_t id = ID(ticket);
	while (!IsServing(slot, id))
	;
	ElementType *element = PopElement(slot);
	// This is to populate the queue because of the lack of GPU constructors.
	if (element == 0)
	element = &elements[slot];
	DoneServing(slot, id);
	return element;
	}

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

This file was added.

				//===--- target_impl.h - OpenMP device RTL target code impl. ---- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Definitions of target specific functions needed in the generic part of the
				// device RTL implementation.
				//
				//===----------------------------------------------------------------------===//

				#ifndef TARGET_IMPL_H
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This is interesting. I've just been looking at moving the functions that use inline asm into a target specific region to help code reuse for non-nvptx architectures. JonChesterfield: This is interesting. I've just been looking at moving the functions that use inline asm into a…
				#define TARGET_IMPL_H

				/// Atomically increment the pointee of \p Ptr by \p Val and return the original
				/// value of the pointee.
				template <typename T> T __kmpc_impl_atomic_add(T *Ptr, T Val) {
				ABataevUnsubmitted Done Reply Inline Actions Why do we need these functions? IMHO, they do not add anything useful, just increase code complexity. ABataev: Why do we need these functions? IMHO, they do not add anything useful, just increase code…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions They add a level of abstraction around the CUDA implementation, here the CUDA functions `atomicAdd` and `atomicExch`. In the AMD `target_impl.h` the `__kmpc_impl_atomic_exchange` would maybe be implemented with an explicit compare-exchange loop (I'm guessing here). jdoerfert: They add a level of abstraction around the CUDA implementation, here the CUDA functions…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions As it happens, atomicExch works fine for amdgcn. I'm not sure all the atomics are supported. Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS. Given some variation in capability between cuda revisions, a compatibility shim that implements at least atomicMax seems a good idea. I'm not sure writing a compatibility layer for all of the possible atomics is within scope - such a thing may already exist elsewhere. JonChesterfield: As it happens, atomicExch works fine for amdgcn. I'm not sure all the atomics are supported.
				ABataevUnsubmitted Not Done Reply Inline Actions It works for AMD but may not work for some other platforms. Better to make it target-dependent anyway. ABataev: It works for AMD but may not work for some other platforms. Better to make it target-dependent…
				grokosUnsubmitted Not Done Reply Inline Actions Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS. Just a correction, 64-bit `atomicMax` is not available for CUDA_ARCH < 350, the 32-bit version is. There has been an attempt to emulate the 64-bit `atomicMax` here: https://reviews.llvm.org/D46185 (although that patch was wrong and the author hasn't responded in well over a year). grokos: > Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Ah, nice. Thanks for the detail. Curiously there doesn't appear to be a signed 64 bit atomicMax in cuda (according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) Rocm's tree contains that (revised) patch to nvptx, presumably for whatever sm_30 device we have under test. I was under the impression that our nvptx tree was identical to upstream but actually there's some divergence for me to patch up. JonChesterfield: Ah, nice. Thanks for the detail. Curiously there doesn't appear to be a signed 64 bit atomicMax…
				return atomicAdd(Ptr, Val);
				}

				/// Atomically exchange the pointee of \p Ptr with \p Val and return the
				/// original value of the pointee.
				template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
				return atomicExch(Ptr, Val);
				}

				#endif // TARGET_IMPL_H

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][NFCI] Cleanup the target state queue implementationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 208070

openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu

openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu

openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h

openmp/libomptarget/deviceRTLs/nvptx/src/state-queuei.h

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

[OpenMP][NFCI] Cleanup the target state queue implementation
Needs ReviewPublic