Page MenuHomePhabricator

[OPENMP][NVPTX]Mark parallel level counter as volatile.
Needs ReviewPublic

Authored by ABataev on May 24 2019, 7:47 AM.

Details

Summary

The parallel level counter is accessed simultaneously by many threads. In combination with atomic operations used in full runtime SPMD mode (fired by dynamic scheduling, for example) it may lead to incorrect results caused by compiler optimizations. According to the CUDA Toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#volatile-qualifier):

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions and memory visibility semantics of synchronization functions.

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.

This is especially important in case of thread divergence mixed with atomic operations. In our case, this may lead to undefined behavior or thread deadlock, especially on CUDA 9 and later. This change is required especially for SPMD mode with full runtime or SPMD mode without full runtime mode but with atomic operations.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I still do not see why volatile is the right solution here:
In D61395 the way the level was tracked was changed but I failed to see how exactly. We have to start there.

  • What is the constant OMP_ACTIVE_PARALLEL_LEVEL doing exactly?

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

  • Probably even more fundamental, what is encoded in the parallelLevel array? It is not (only) the parallel level, correct? It is per warp, correct?

It is only parallel level, and yes, per warp, required to handle L2+ parallelism. Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

This is especially important in case of thread divergence mixed with atomic operations.

We do not use atomics on the parallelLevel do we?

It is used not for parallelLevel, but for dynamic scheduling and other operations, like SM management used in full runtime mode.

On which level is the thread divergence, in a warp?

Yes.

And another fundamental question:
What is the connection of the test to the change? It looks totally unrelated as it does not check the parallel level at all.

It demonstrates the problem in SPMD mode + full runtime, which actively uses atomic operations. Dynamic scheduling forces using of atomics to demonstrate the problem.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

According to PTX ISA, volatile modifier has the same semantics as relaxed modifier and thus the load/store operations become morally strong. Plus, volatile synchronizes memory accesses on each read/write operations forthe variable.
The extra synchronization is not very good, because it slows down the execution of the whole program.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

According to PTX ISA, volatile modifier has the same semantics as relaxed modifier and thus the load/store operations become morally strong. Plus, volatile synchronizes memory accesses on each read/write operations forthe variable.
The extra synchronization is not very good, because it slows down the execution of the whole program.

I don't get it. You say volatile makes each operation synchronize but you also say synchronization is not very good. I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

arsenm added a comment.Jun 6 2019, 4:27 PM

I thought allowing volatile for synchronization was a mistake, and CUDA deprecates doing so, at leasts since Volta

I thought allowing volatile for synchronization was a mistake, and CUDA deprecates doing so, at leasts since Volta

I don't see any changes in latest PTX ISA for the semantics of volatile.

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.
Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

hfinkel added subscribers: jlebar, Hahnfeld, tra and 2 others.EditedJun 11 2019, 8:47 AM

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants anything other than to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme.

Asking for additional high-level documentation on how some subsystem functions is not, on its face, unreasonable. This happens during code review frequently, and from my perspective, should probably happen even more frequently than it already does. If you feel that the relevant documentation already exists, then it's perfectly appropriate to point this out.

Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

This certainly seems like something that should be documented. The implementation can certainly have limits, but we should certainly know what they are explicitly whenever possible.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

This is a very interesting situation. First, I can certainly imagine that older versions of LLVM might have implicitly treated volatile in a way consistent with these semantics, but I have no idea whether Clang's CUDA implementation now supports these semantics explicitly. I'm not saying that we shouldn't use this feature of CUDA as a result, but this is important to understand. I certainly did not realize that CUDA provided special volatile semantics that include atomic properties, and if that's something that LLVM needs to support, we certainly need to have this explicitly documented.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Regardless, please don't take offence to skepticism around using volatile to accomplish synchronization. I doubt that most people know about CUDA's particular properties in this regard, and as you know, in C/C++ the properties are much weaker. This is definitely something that requires some comments so that readers of the code understand what's going on.

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme.

Asking for additional high-level documentation on how some subsystem functions is not, on its face, unreasonable. This happens during code review frequently, and from my perspective, should probably happen even more frequently than it already does. If you feel that the relevant documentation already exists, then it's perfectly appropriate to point this out.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

This certainly seems like something that should be documented. The implementation can certainly have limits, but we should certainly know what they are explicitly whenever possible.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

This is a very interesting situation. First, I can certainly imagine that older versions of LLVM might have implicitly treated volatile in a way consistent with these semantics, but I have no idea whether Clang's CUDA implementation now supports these semantics explicitly. I'm not saying that we shouldn't use this feature of CUDA as a result, but this is important to understand. I certainly did not realize that CUDA provided special volatile semantics that include atomic properties, and if that's something that LLVM needs to support, we certainly need to have this explicitly documented.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Regardless, please don't take offence to skepticism around using volatile to accomplish synchronization. I doubt that most people know about CUDA's particular properties in this regard, and as you know, in C/C++ the properties are much weaker. This is definitely something that requires some comments so that readers of the code understand what's going on.

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

jfb added a comment.Jun 11 2019, 9:50 AM

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Alexey: Hal is trying to turn this review into a productive one. Please internalize his guidance, and come back ready to answer feedback which you'd rather not answer. I don't see this review going anywhere without you changing your approach. You're frustrated by having to explain what you consider to be basics, and that's fine. Say so politely, and point at relevant resources. Reviews don't *need* to be a school lecture, but an open-source project is healthy when everyone learns. First it forces the committer to actually understand what they're doing (explaining basics is a great learning tool), and second it makes the code easier to maintain if the original committer stops participating.

In D62393#1538350, @jfb wrote:

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Alexey: Hal is trying to turn this review into a productive one. Please internalize his guidance, and come back ready to answer feedback which you'd rather not answer. I don't see this review going anywhere without you changing your approach. You're frustrated by having to explain what you consider to be basics, and that's fine. Say so politely, and point at relevant resources. Reviews don't *need* to be a school lecture, but an open-source project is healthy when everyone learns. First it forces the committer to actually understand what they're doing (explaining basics is a great learning tool), and second it makes the code easier to maintain if the original committer stops participating.

I explained already everything several times. I said explicitly that in Cuda volatile is a little bit different than in C/C++. I explained, that volatile operations has the same semantics as relaxed atomic per PTX ISA. What else I could say? To explain how parallelLevel counter works I need to explain how the warp works. The reviewer need to know it himself. After he gets this basic knowledge, the main idea will be clear. If not, I can help to understand it. The code of the runtime is not a kind of rocket science, actually it is very simple. But you need to spend some time yourself.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

That's better than what I had in mind, so it sounds like a good step to me. (I don't know what's the norm here, but maybe with a comment and an issue to fix it later.) Cheers.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

That's better than what I had in mind, so it sounds like a good step to me. (I don't know what's the norm here, but maybe with a comment and an issue to fix it later.) Cheers.

Ok, I can do this, thanks for the idea.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments.

+1

You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.

I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:

  1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
bool isUnordered() const {
  return (getOrdering() == AtomicOrdering::NotAtomic ||
          getOrdering() == AtomicOrdering::Unordered) &&
         !isVolatile();
}

while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."

And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?

Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments.

+1

You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.

I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:

  1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
bool isUnordered() const {
  return (getOrdering() == AtomicOrdering::NotAtomic ||
          getOrdering() == AtomicOrdering::Unordered) &&
         !isVolatile();
}

while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."

And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?

Yes, probably it is a good idea to fix possible incompatibility with nvcc. But this problem may be nit very significant for Cuda if some possibly dangerous optimization passes are disabled for CUDA toolchain. But anyway, better to translate it directly to relaxed atomic operations just in case.

Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

jfb added a comment.Jun 11 2019, 3:29 PM

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

/// An LValue is a candidate for having its loads and stores be made atomic if
/// we are operating under /volatile:ms *and* the LValue itself is volatile and
/// performing such an operation can be performed without a libcall.
bool CodeGenFunction::LValueIsSuitableForInlineAtomic(LValue LV) {
  if (!CGM.getCodeGenOpts().MSVolatile) return false;
  AtomicInfo AI(*this, LV);
  bool IsVolatile = LV.isVolatile() || hasVolatileMember(LV.getType());
  // An atomic is inline if we don't need to use a libcall.
  bool AtomicIsInline = !AI.shouldUseLibcall();
  // MSVC doesn't seem to do this for types wider than a pointer.
  if (getContext().getTypeSize(LV.getType()) >
      getContext().getTypeSize(getContext().getIntPtrType()))
    return false;
  return IsVolatile && AtomicIsInline;
}

Making this return true for when LangOpts.CUDAIsDevice might essentially do what we need?

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

/// An LValue is a candidate for having its loads and stores be made atomic if
/// we are operating under /volatile:ms *and* the LValue itself is volatile and
/// performing such an operation can be performed without a libcall.
bool CodeGenFunction::LValueIsSuitableForInlineAtomic(LValue LV) {
  if (!CGM.getCodeGenOpts().MSVolatile) return false;
  AtomicInfo AI(*this, LV);
  bool IsVolatile = LV.isVolatile() || hasVolatileMember(LV.getType());
  // An atomic is inline if we don't need to use a libcall.
  bool AtomicIsInline = !AI.shouldUseLibcall();
  // MSVC doesn't seem to do this for types wider than a pointer.
  if (getContext().getTypeSize(LV.getType()) >
      getContext().getTypeSize(getContext().getIntPtrType()))
    return false;
  return IsVolatile && AtomicIsInline;
}

Making this return true for when LangOpts.CUDAIsDevice might essentially do what we need?

Better to ask @tra or @jlebar. Maybe, clang already does something similar for Cuda?

jfb added a comment.Jun 11 2019, 4:03 PM
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

My point exactly: MS volatile promotes volatile to seq_cst, it's trivial to copy and promote to relaxed and isn't too objectionable because we've got precedent.

Further consideration: you say you want to match the nvcc semantics... but it sounds like it doesn't actually have great / documented semantics, and they might changes. Should this volatile->relaxed be opt-in or opt-out? Is that the only thing we'd want to guarantee?

In D62393#1539057, @jfb wrote:
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

My point exactly: MS volatile promotes volatile to seq_cst, it's trivial to copy and promote to relaxed and isn't too objectionable because we've got precedent.

Further consideration: you say you want to match the nvcc semantics... but it sounds like it doesn't actually have great / documented semantics, and they might changes. Should this volatile->relaxed be opt-in or opt-out? Is that the only thing we'd want to guarantee?

I think, for now we can support this unconditionally. Later, if the semantics will change, we can check for the cuda version and disable this semantics for newer versions. seems to me, volatile->relaxed must be enough.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Trying to catch up on all the comments, replies to some:

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

I don't agree with this statement and its implication that we need to track both levels and active-levels as I already argued in D61380: In my understanding of the standard, omp_in_parallel should return true iff it's nested somewhere in an active parallel region.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

That patch was about lowering atomics in LLVM IR to PTX's volatile which is clearly documented to work. This has nothing to do with the volatile modifier in C/C++ which does not get translated into atomics in LLVM IR by default AFAIK (at least not without switches for compatibility with MS). I'm not sure about the semantics in CUDA and it may not be documented at all (as others have noted). I'd probably agree with what @__simt__ wrote about this.

Trying to catch up on all the comments, replies to some:

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

I don't agree with this statement and its implication that we need to track both levels and active-levels as I already argued in D61380: In my understanding of the standard, omp_in_parallel should return true iff it's nested somewhere in an active parallel region.

Maybe I did not express myself clear, but this is exactly how it works currently. I agreed with you and implemented it exactly just like you said: 1 is returned if at least 1 parallel region is active.
And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

That patch was about lowering atomics in LLVM IR to PTX's volatile which is clearly documented to work. This has nothing to do with the volatile modifier in C/C++ which does not get translated into atomics in LLVM IR by default AFAIK (at least not without switches for compatibility with MS). I'm not sure about the semantics in CUDA and it may not be documented at all (as others have noted). I'd probably agree with what @__simt__ wrote about this.

I also agree that clang in Cuda mode should translate it to relaxed atomic, if it is not. Otherwise, it means the compiler incorrectly handles volatile from CUDA point of view.

tra added a comment.Jun 13 2019, 9:14 AM

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Yes, sorry, wrong link. This is the right one https://stackoverflow.com/questions/15331009/when-to-use-volatile-with-shared-cuda-memory
No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Yes, I know that clang supports inline PTX asm. Probably, I can use them directly to get what I need with both nvcc and clang for sure.

tra added a comment.Jun 13 2019, 10:06 AM
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

In D62393#1542042, @tra wrote:
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

Yes, this what I'm going to do.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

I know, also fixed couple of bugs here masked with volatile but not in this particular case.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

Yes, I checked this and relied on fact that clang generates .volatile kind of ops.
I know, it is not guaranteed. Anyway, we could use conditional compilation in case of any changes in future versions. But ptx asm instructions are the best option here, I think.

In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

Yes, this is what I need and what I'm trying to explain. Thanks, Olivier, you did it better than me.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

tra added a comment.Jun 13 2019, 10:42 AM
In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

I didn't mean this to be a critical statement. I apologize if it came out as such. The example was purely hypothetical to illustrate the point that one should not assume that a C++ qualifier will always do what C++ standard does not guarantee.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

Several threads in the same warp access the same parallelLevel array element. The compiler optimize out some of the accesses (because it is not aware of the access in several threads) and it leads to undefined behaviour.

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

I agree, that documentation should be added, I already said this. But this is not related directly to this patch.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Yes, I agree, and I can answer the technical details, if they are productive and lead to the best code quality.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

reames removed a subscriber: reames.Jun 13 2019, 11:33 AM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.). I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward. The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed. It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions. And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.
I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined. Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.
As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them. He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3. And you need the latest version of the libomptarget, clang-ykt version is implemented differently and has no this problem.

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

  • Complex technical discussions with commitments towards improving the patches and review process ([4] to name one), and
  • Various comments and accusations that people found inappropriate ([1,5-11] to restrict it to these patches only). The one you refer to is in [1] and in my follow up [12] I agreed that my words were not well chosen.

[0] https://reviews.llvm.org/D62199#1512617
[1] https://reviews.llvm.org/D62393#1528421
[2] https://reviews.llvm.org/D62199#1515182
[3] https://reviews.llvm.org/D62199#1517073

[4] https://reviews.llvm.org/D62199#1515072

[5] https://reviews.llvm.org/D62199#1512638
[6] https://reviews.llvm.org/D62199#1513126
[7] https://reviews.llvm.org/D62199#1514027
[8] https://reviews.llvm.org/D62393#1533461
[9] https://reviews.llvm.org/D62393#1539086
[10] https://reviews.llvm.org/D62393#1539630
[11] https://reviews.llvm.org/D62393#1542505

[12] https://reviews.llvm.org/D62199#1515182

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

[0] was enough for me to treat all your next actions in a certain sense. Your questions do not mean anything, you already made the decision and all the questions were just to confirm this first judgement.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

hfinkel added a comment.EditedJun 13 2019, 6:06 PM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

Good!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

Sure, I'll add a comment for a variable describing its purpose and why it must be marked volatile.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

Will provide additional info on Tuesday. Most probably, this simplified test does not trigger the problem in your c9nfiguration. Will send the original complex test that triggers the problem.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?
If so, what part of the D62318 patch is causing the problem?

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

Checked with Cuda9, it works. Most probably, the problem is related to Cuda8 only. Most probably, there were some optimizations that were fixed in Cuda9 (in ptxas tool). Clang defines macro CUDA_VERSION and sets it to 8000 for Cuda8. I can check this macro and use volatile modifiers only for Cuda8.
To reproduce the problem, you need to build the runtime without debug info and build the test at O3.

ABataev updated this revision to Diff 206671.Jun 26 2019, 7:54 AM

Use inline assembler to implement volatile load/stores for parallelLevel.

tra added inline comments.Jun 26 2019, 8:50 AM
libomptarget/deviceRTLs/nvptx/src/supporti.h
97–101

Perhaps it can be deduped into something like this:

#ifdef __LP64__
#define PTR_CONSTRAINT               "l"
#else  // __LP64__
#define PTR_CONSTRAINT               "r"
#endif // __LP64__

... 
asm volatile("st.volatile.u8 [%1], %0;" ::"r"(Val), : PTR_CONSTRAINT (&parLevel) 
             : "memory");
ABataev marked an inline comment as done.Jun 26 2019, 8:54 AM
ABataev added inline comments.
libomptarget/deviceRTLs/nvptx/src/supporti.h
97–101

Sure, thanks.

ABataev updated this revision to Diff 206693.Jun 26 2019, 9:01 AM

Simplified constraint.

Thanks for the explicit asm. One of the hazards of compiling cuda for amdgcn is that volatile doesn't imply atomic, so this is a clear warning that I'll have a bug on merge.

libomptarget/deviceRTLs/nvptx/src/supporti.h
121

Could this function call getParallelLevel, increment/decrement the result, then call setParallelLevel?

ABataev updated this revision to Diff 219739.Wed, Sep 11, 9:53 AM

Used getParallelLevel/setParallelLevel instead of explicit asm instructions in changeParallelLevel.

ABataev marked an inline comment as done.Wed, Sep 11, 9:53 AM