This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP][NVPTX]Mark parallel level counter as volatile.
AbandonedPublic

Authored by ABataev on May 24 2019, 7:47 AM.

Details

Summary

The parallel level counter is accessed simultaneously by many threads. In combination with atomic operations used in full runtime SPMD mode (fired by dynamic scheduling, for example) it may lead to incorrect results caused by compiler optimizations. According to the CUDA Toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#volatile-qualifier):

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions and memory visibility semantics of synchronization functions.

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.

This is especially important in case of thread divergence mixed with atomic operations. In our case, this may lead to undefined behavior or thread deadlock, especially on CUDA 9 and later. This change is required especially for SPMD mode with full runtime or SPMD mode without full runtime mode but with atomic operations.

Event Timeline

ABataev created this revision.May 24 2019, 7:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2019, 7:47 AM
grokos accepted this revision.Jun 3 2019, 5:27 PM
grokos edited the summary of this revision. (Show Details)

I formatted the description a bit. The patch looks good and the reasoning behind it is now obvious, in line with Nvidia's documentation.

This revision is now accepted and ready to land.Jun 3 2019, 5:29 PM

I still do not see why volatile is the right solution here:
In D61395 the way the level was tracked was changed but I failed to see how exactly. We have to start there.

  • What is the constant OMP_ACTIVE_PARALLEL_LEVEL doing exactly?
  • Probably even more fundamental, what is encoded in the parallelLevel array? It is not (only) the parallel level, correct? It is per warp, correct?

This is especially important in case of thread divergence mixed with atomic operations.

We do not use atomics on the parallelLevel do we?
On which level is the thread divergence, in a warp?

And another fundamental question:
What is the connection of the test to the change? It looks totally unrelated as it does not check the parallel level at all.

jdoerfert requested changes to this revision.Jun 3 2019, 5:44 PM

This is not obvious to me.

This revision now requires changes to proceed.Jun 3 2019, 5:44 PM

I still do not see why volatile is the right solution here:
In D61395 the way the level was tracked was changed but I failed to see how exactly. We have to start there.

  • What is the constant OMP_ACTIVE_PARALLEL_LEVEL doing exactly?

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

  • Probably even more fundamental, what is encoded in the parallelLevel array? It is not (only) the parallel level, correct? It is per warp, correct?

It is only parallel level, and yes, per warp, required to handle L2+ parallelism. Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

This is especially important in case of thread divergence mixed with atomic operations.

We do not use atomics on the parallelLevel do we?

It is used not for parallelLevel, but for dynamic scheduling and other operations, like SM management used in full runtime mode.

On which level is the thread divergence, in a warp?

Yes.

And another fundamental question:
What is the connection of the test to the change? It looks totally unrelated as it does not check the parallel level at all.

It demonstrates the problem in SPMD mode + full runtime, which actively uses atomic operations. Dynamic scheduling forces using of atomics to demonstrate the problem.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

According to PTX ISA, volatile modifier has the same semantics as relaxed modifier and thus the load/store operations become morally strong. Plus, volatile synchronizes memory accesses on each read/write operations forthe variable.
The extra synchronization is not very good, because it slows down the execution of the whole program.

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

Because it is per warp, it is required to be volatile, because being combined with the atomic operations + fully inlined runtime functions, the memory ordering is not preserved, and thus it leads to undefined behavior in full runtime mode, which uses atomic operations.

The accesses to the parallelLevel array are non-atomic, it is not surprising that they can behave "unexpected" when atomic accesses are present. Why don't we access parallelLevel atomically? Why don't we synchronize the warp after the parallel level is changed? If we want to communicate the updated value to all thread it seems natural to do that explicitly and not to rely on volatile.

According to PTX ISA, volatile modifier has the same semantics as relaxed modifier and thus the load/store operations become morally strong. Plus, volatile synchronizes memory accesses on each read/write operations forthe variable.
The extra synchronization is not very good, because it slows down the execution of the whole program.

I don't get it. You say volatile makes each operation synchronize but you also say synchronization is not very good. I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

arsenm added a comment.Jun 6 2019, 4:27 PM

I thought allowing volatile for synchronization was a mistake, and CUDA deprecates doing so, at leasts since Volta

I thought allowing volatile for synchronization was a mistake, and CUDA deprecates doing so, at leasts since Volta

I don't see any changes in latest PTX ISA for the semantics of volatile.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.
Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

hfinkel added subscribers: jlebar, Hahnfeld, tra and 2 others.EditedJun 11 2019, 8:47 AM

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants anything other than to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme.

Asking for additional high-level documentation on how some subsystem functions is not, on its face, unreasonable. This happens during code review frequently, and from my perspective, should probably happen even more frequently than it already does. If you feel that the relevant documentation already exists, then it's perfectly appropriate to point this out.

Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

This certainly seems like something that should be documented. The implementation can certainly have limits, but we should certainly know what they are explicitly whenever possible.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

This is a very interesting situation. First, I can certainly imagine that older versions of LLVM might have implicitly treated volatile in a way consistent with these semantics, but I have no idea whether Clang's CUDA implementation now supports these semantics explicitly. I'm not saying that we shouldn't use this feature of CUDA as a result, but this is important to understand. I certainly did not realize that CUDA provided special volatile semantics that include atomic properties, and if that's something that LLVM needs to support, we certainly need to have this explicitly documented.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Regardless, please don't take offence to skepticism around using volatile to accomplish synchronization. I doubt that most people know about CUDA's particular properties in this regard, and as you know, in C/C++ the properties are much weaker. This is definitely something that requires some comments so that readers of the code understand what's going on.

The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme.

Asking for additional high-level documentation on how some subsystem functions is not, on its face, unreasonable. This happens during code review frequently, and from my perspective, should probably happen even more frequently than it already does. If you feel that the relevant documentation already exists, then it's perfectly appropriate to point this out.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

This certainly seems like something that should be documented. The implementation can certainly have limits, but we should certainly know what they are explicitly whenever possible.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).

Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.

You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.

Volatile provides required level of atomicity per PTX ISA.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

This is a very interesting situation. First, I can certainly imagine that older versions of LLVM might have implicitly treated volatile in a way consistent with these semantics, but I have no idea whether Clang's CUDA implementation now supports these semantics explicitly. I'm not saying that we shouldn't use this feature of CUDA as a result, but this is important to understand. I certainly did not realize that CUDA provided special volatile semantics that include atomic properties, and if that's something that LLVM needs to support, we certainly need to have this explicitly documented.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Regardless, please don't take offence to skepticism around using volatile to accomplish synchronization. I doubt that most people know about CUDA's particular properties in this regard, and as you know, in C/C++ the properties are much weaker. This is definitely something that requires some comments so that readers of the code understand what's going on.

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

jfb added a comment.Jun 11 2019, 9:50 AM

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Alexey: Hal is trying to turn this review into a productive one. Please internalize his guidance, and come back ready to answer feedback which you'd rather not answer. I don't see this review going anywhere without you changing your approach. You're frustrated by having to explain what you consider to be basics, and that's fine. Say so politely, and point at relevant resources. Reviews don't *need* to be a school lecture, but an open-source project is healthy when everyone learns. First it forces the committer to actually understand what they're doing (explaining basics is a great learning tool), and second it makes the code easier to maintain if the original committer stops participating.

In D62393#1538350, @jfb wrote:

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Alexey: Hal is trying to turn this review into a productive one. Please internalize his guidance, and come back ready to answer feedback which you'd rather not answer. I don't see this review going anywhere without you changing your approach. You're frustrated by having to explain what you consider to be basics, and that's fine. Say so politely, and point at relevant resources. Reviews don't *need* to be a school lecture, but an open-source project is healthy when everyone learns. First it forces the committer to actually understand what they're doing (explaining basics is a great learning tool), and second it makes the code easier to maintain if the original committer stops participating.

I explained already everything several times. I said explicitly that in Cuda volatile is a little bit different than in C/C++. I explained, that volatile operations has the same semantics as relaxed atomic per PTX ISA. What else I could say? To explain how parallelLevel counter works I need to explain how the warp works. The reviewer need to know it himself. After he gets this basic knowledge, the main idea will be clear. If not, I can help to understand it. The code of the runtime is not a kind of rocket science, actually it is very simple. But you need to spend some time yourself.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

That's better than what I had in mind, so it sounds like a good step to me. (I don't know what's the norm here, but maybe with a comment and an issue to fix it later.) Cheers.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

That's better than what I had in mind, so it sounds like a good step to me. (I don't know what's the norm here, but maybe with a comment and an issue to fix it later.) Cheers.

Ok, I can do this, thanks for the idea.

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments.

+1

You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.

I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:

  1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
bool isUnordered() const {
  return (getOrdering() == AtomicOrdering::NotAtomic ||
          getOrdering() == AtomicOrdering::Unordered) &&
         !isVolatile();
}

while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."

And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?

Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments.

+1

You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.

I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:

  1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
bool isUnordered() const {
  return (getOrdering() == AtomicOrdering::NotAtomic ||
          getOrdering() == AtomicOrdering::Unordered) &&
         !isVolatile();
}

while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."

And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?

Yes, probably it is a good idea to fix possible incompatibility with nvcc. But this problem may be nit very significant for Cuda if some possibly dangerous optimization passes are disabled for CUDA toolchain. But anyway, better to translate it directly to relaxed atomic operations just in case.

Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

jfb added a comment.Jun 11 2019, 3:29 PM

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

/// An LValue is a candidate for having its loads and stores be made atomic if
/// we are operating under /volatile:ms *and* the LValue itself is volatile and
/// performing such an operation can be performed without a libcall.
bool CodeGenFunction::LValueIsSuitableForInlineAtomic(LValue LV) {
  if (!CGM.getCodeGenOpts().MSVolatile) return false;
  AtomicInfo AI(*this, LV);
  bool IsVolatile = LV.isVolatile() || hasVolatileMember(LV.getType());
  // An atomic is inline if we don't need to use a libcall.
  bool AtomicIsInline = !AI.shouldUseLibcall();
  // MSVC doesn't seem to do this for types wider than a pointer.
  if (getContext().getTypeSize(LV.getType()) >
      getContext().getTypeSize(getContext().getIntPtrType()))
    return false;
  return IsVolatile && AtomicIsInline;
}

Making this return true for when LangOpts.CUDAIsDevice might essentially do what we need?

In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

/// An LValue is a candidate for having its loads and stores be made atomic if
/// we are operating under /volatile:ms *and* the LValue itself is volatile and
/// performing such an operation can be performed without a libcall.
bool CodeGenFunction::LValueIsSuitableForInlineAtomic(LValue LV) {
  if (!CGM.getCodeGenOpts().MSVolatile) return false;
  AtomicInfo AI(*this, LV);
  bool IsVolatile = LV.isVolatile() || hasVolatileMember(LV.getType());
  // An atomic is inline if we don't need to use a libcall.
  bool AtomicIsInline = !AI.shouldUseLibcall();
  // MSVC doesn't seem to do this for types wider than a pointer.
  if (getContext().getTypeSize(LV.getType()) >
      getContext().getTypeSize(getContext().getIntPtrType()))
    return false;
  return IsVolatile && AtomicIsInline;
}

Making this return true for when LangOpts.CUDAIsDevice might essentially do what we need?

Better to ask @tra or @jlebar. Maybe, clang already does something similar for Cuda?

jfb added a comment.Jun 11 2019, 4:03 PM
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

My point exactly: MS volatile promotes volatile to seq_cst, it's trivial to copy and promote to relaxed and isn't too objectionable because we've got precedent.

Further consideration: you say you want to match the nvcc semantics... but it sounds like it doesn't actually have great / documented semantics, and they might changes. Should this volatile->relaxed be opt-in or opt-out? Is that the only thing we'd want to guarantee?

In D62393#1539057, @jfb wrote:
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

My point exactly: MS volatile promotes volatile to seq_cst, it's trivial to copy and promote to relaxed and isn't too objectionable because we've got precedent.

Further consideration: you say you want to match the nvcc semantics... but it sounds like it doesn't actually have great / documented semantics, and they might changes. Should this volatile->relaxed be opt-in or opt-out? Is that the only thing we'd want to guarantee?

I think, for now we can support this unconditionally. Later, if the semantics will change, we can check for the cuda version and disable this semantics for newer versions. seems to me, volatile->relaxed must be enough.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Trying to catch up on all the comments, replies to some:

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

I don't agree with this statement and its implication that we need to track both levels and active-levels as I already argued in D61380: In my understanding of the standard, omp_in_parallel should return true iff it's nested somewhere in an active parallel region.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

That patch was about lowering atomics in LLVM IR to PTX's volatile which is clearly documented to work. This has nothing to do with the volatile modifier in C/C++ which does not get translated into atomics in LLVM IR by default AFAIK (at least not without switches for compatibility with MS). I'm not sure about the semantics in CUDA and it may not be documented at all (as others have noted). I'd probably agree with what @__simt__ wrote about this.

Trying to catch up on all the comments, replies to some:

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

I don't agree with this statement and its implication that we need to track both levels and active-levels as I already argued in D61380: In my understanding of the standard, omp_in_parallel should return true iff it's nested somewhere in an active parallel region.

Maybe I did not express myself clear, but this is exactly how it works currently. I agreed with you and implemented it exactly just like you said: 1 is returned if at least 1 parallel region is active.
And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

That patch was about lowering atomics in LLVM IR to PTX's volatile which is clearly documented to work. This has nothing to do with the volatile modifier in C/C++ which does not get translated into atomics in LLVM IR by default AFAIK (at least not without switches for compatibility with MS). I'm not sure about the semantics in CUDA and it may not be documented at all (as others have noted). I'd probably agree with what @__simt__ wrote about this.

I also agree that clang in Cuda mode should translate it to relaxed atomic, if it is not. Otherwise, it means the compiler incorrectly handles volatile from CUDA point of view.

tra added a comment.Jun 13 2019, 9:14 AM

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Yes, sorry, wrong link. This is the right one https://stackoverflow.com/questions/15331009/when-to-use-volatile-with-shared-cuda-memory
No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Yes, I know that clang supports inline PTX asm. Probably, I can use them directly to get what I need with both nvcc and clang for sure.

tra added a comment.Jun 13 2019, 10:06 AM
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

In D62393#1542042, @tra wrote:
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

Yes, this what I'm going to do.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

I know, also fixed couple of bugs here masked with volatile but not in this particular case.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

Yes, I checked this and relied on fact that clang generates .volatile kind of ops.
I know, it is not guaranteed. Anyway, we could use conditional compilation in case of any changes in future versions. But ptx asm instructions are the best option here, I think.

In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

Yes, this is what I need and what I'm trying to explain. Thanks, Olivier, you did it better than me.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

tra added a comment.Jun 13 2019, 10:42 AM
In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

I didn't mean this to be a critical statement. I apologize if it came out as such. The example was purely hypothetical to illustrate the point that one should not assume that a C++ qualifier will always do what C++ standard does not guarantee.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

Several threads in the same warp access the same parallelLevel array element. The compiler optimize out some of the accesses (because it is not aware of the access in several threads) and it leads to undefined behaviour.

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

I agree, that documentation should be added, I already said this. But this is not related directly to this patch.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Yes, I agree, and I can answer the technical details, if they are productive and lead to the best code quality.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

reames removed a subscriber: reames.Jun 13 2019, 11:33 AM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.). I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward. The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed. It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions. And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.
I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined. Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.
As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them. He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3. And you need the latest version of the libomptarget, clang-ykt version is implemented differently and has no this problem.

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

  • Complex technical discussions with commitments towards improving the patches and review process ([4] to name one), and
  • Various comments and accusations that people found inappropriate ([1,5-11] to restrict it to these patches only). The one you refer to is in [1] and in my follow up [12] I agreed that my words were not well chosen.

[0] https://reviews.llvm.org/D62199#1512617
[1] https://reviews.llvm.org/D62393#1528421
[2] https://reviews.llvm.org/D62199#1515182
[3] https://reviews.llvm.org/D62199#1517073

[4] https://reviews.llvm.org/D62199#1515072

[5] https://reviews.llvm.org/D62199#1512638
[6] https://reviews.llvm.org/D62199#1513126
[7] https://reviews.llvm.org/D62199#1514027
[8] https://reviews.llvm.org/D62393#1533461
[9] https://reviews.llvm.org/D62393#1539086
[10] https://reviews.llvm.org/D62393#1539630
[11] https://reviews.llvm.org/D62393#1542505

[12] https://reviews.llvm.org/D62199#1515182

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

  • Complex technical discussions with commitments towards improving the patches and review process ([4] to name one), and
  • Various comments and accusations that people found inappropriate ([1,5-11] to restrict it to these patches only). The one you refer to is in [1] and in my follow up [12] I agreed that my words were not well chosen.

[0] https://reviews.llvm.org/D62199#1512617
[1] https://reviews.llvm.org/D62393#1528421
[2] https://reviews.llvm.org/D62199#1515182
[3] https://reviews.llvm.org/D62199#1517073

[4] https://reviews.llvm.org/D62199#1515072

[5] https://reviews.llvm.org/D62199#1512638
[6] https://reviews.llvm.org/D62199#1513126
[7] https://reviews.llvm.org/D62199#1514027
[8] https://reviews.llvm.org/D62393#1533461
[9] https://reviews.llvm.org/D62393#1539086
[10] https://reviews.llvm.org/D62393#1539630
[11] https://reviews.llvm.org/D62393#1542505

[12] https://reviews.llvm.org/D62199#1515182

[0] was enough for me to treat all your next actions in a certain sense. Your questions do not mean anything, you already made the decision and all the questions were just to confirm this first judgement.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

hfinkel added a comment.EditedJun 13 2019, 6:06 PM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

Good!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

Sure, I'll add a comment for a variable describing its purpose and why it must be marked volatile.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

Will provide additional info on Tuesday. Most probably, this simplified test does not trigger the problem in your c9nfiguration. Will send the original complex test that triggers the problem.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?
If so, what part of the D62318 patch is causing the problem?

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

Checked with Cuda9, it works. Most probably, the problem is related to Cuda8 only. Most probably, there were some optimizations that were fixed in Cuda9 (in ptxas tool). Clang defines macro CUDA_VERSION and sets it to 8000 for Cuda8. I can check this macro and use volatile modifiers only for Cuda8.
To reproduce the problem, you need to build the runtime without debug info and build the test at O3.

ABataev updated this revision to Diff 206671.Jun 26 2019, 7:54 AM

Use inline assembler to implement volatile load/stores for parallelLevel.

tra added inline comments.Jun 26 2019, 8:50 AM
libomptarget/deviceRTLs/nvptx/src/supporti.h
95–99 ↗(On Diff #206671)

Perhaps it can be deduped into something like this:

#ifdef __LP64__
#define PTR_CONSTRAINT               "l"
#else  // __LP64__
#define PTR_CONSTRAINT               "r"
#endif // __LP64__

... 
asm volatile("st.volatile.u8 [%1], %0;" ::"r"(Val), : PTR_CONSTRAINT (&parLevel) 
             : "memory");
ABataev marked an inline comment as done.Jun 26 2019, 8:54 AM
ABataev added inline comments.
libomptarget/deviceRTLs/nvptx/src/supporti.h
95–99 ↗(On Diff #206671)

Sure, thanks.

ABataev updated this revision to Diff 206693.Jun 26 2019, 9:01 AM

Simplified constraint.

Thanks for the explicit asm. One of the hazards of compiling cuda for amdgcn is that volatile doesn't imply atomic, so this is a clear warning that I'll have a bug on merge.

libomptarget/deviceRTLs/nvptx/src/supporti.h
121 ↗(On Diff #218501)

Could this function call getParallelLevel, increment/decrement the result, then call setParallelLevel?

ABataev updated this revision to Diff 219739.Sep 11 2019, 9:53 AM

Used getParallelLevel/setParallelLevel instead of explicit asm instructions in changeParallelLevel.

ABataev marked an inline comment as done.Sep 11 2019, 9:53 AM

It is probably not directly related to this patch, but I think the logic of omp_in_parallel() is not valid. It only works, because the level counting is also broken. This code

int main(){
#pragma omp target 
  #pragma omp parallel if(0)
    if(omp_get_thread_num()==0) {
      printf("omp_in_parallel=%i\n", omp_in_parallel()); 
      printf("omp_get_level=%i\n", omp_get_level()); 
      printf("omp_get_active_level=%i\n", omp_get_active_level()); 
    }
  return 0;
}

should print (and does so with env OMP_TARGET_OFFLOAD=disabled):

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

As I understand the code for omp_in_parallel(), the example code would print omp_in_parallel=1 if the level would correctly be increased.

It is probably not directly related to this patch, but I think the logic of omp_in_parallel() is not valid. It only works, because the level counting is also broken. This code

int main(){
#pragma omp target 
  #pragma omp parallel if(0)
    if(omp_get_thread_num()==0) {
      printf("omp_in_parallel=%i\n", omp_in_parallel()); 
      printf("omp_get_level=%i\n", omp_get_level()); 
      printf("omp_get_active_level=%i\n", omp_get_active_level()); 
    }
  return 0;
}

should print (and does so with env OMP_TARGET_OFFLOAD=disabled):

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

As I understand the code for omp_in_parallel(), the example code would print omp_in_parallel=1 if the level would correctly be increased.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

Ok, thanks for testing. I probably didn't apply the patch correctly, then.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

I got above output only with host offloading, so that part also seems to be fine.

After revisiting the code I understood, that the value returned by getParallelLevel is basically a bitfield: struct{int active_parallel_level : 1; int parallel_level : 7;}.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

Ok, thanks for testing. I probably didn't apply the patch correctly, then.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

I got above output only with host offloading, so that part also seems to be fine.

After revisiting the code I understood, that the value returned by getParallelLevel is basically a bitfield: struct{int active_parallel_level : 1; int parallel_level : 7;}.

Yes, you got it!

There was a lot of discussion here, largely about volatile vs atomic. The final patch looks uncontentious to me - accesses to parallelLevel are now wrapped in calls that implement atomic access in the appropriate fashion for nvptx. I agree that the test is at coarse granularity, but as it's essentially poking at the race condition I can't see a better way to do it.

I'm confident in this patch as it stands. @jdoerfert?

@grokos , Georgios, could you approve the patch if there are no more comments?

So this changed over time, we for sure need to update the name and maybe commit message.

As @JonChesterfield pointed out, the wrapping of the accesses in function calls is fine.
The two functions with inline assembly should live in the target_impl.h though.

At some point it was said this fixes a problem that only occurs with D62318 applied.
Is this still the case? If not, we need to specify exactly how we can reproduce this issue, e.g., in a comment next to the target offload region.

So this changed over time, we for sure need to update the name and maybe commit message.

As @JonChesterfield pointed out, the wrapping of the accesses in function calls is fine.
The two functions with inline assembly should live in the target_impl.h though.

At some point it was said this fixes a problem that only occurs with D62318 applied.
Is this still the case? If not, we need to specify exactly how we can reproduce this issue, e.g., in a comment next to the target offload region.

Yes, it still the case. The cuda8 compiler is too optimistic and makes some incorrect optimizations with D62318. This patch prevents these optimizations. Cuda9 and later have no such problems.

ABataev updated this revision to Diff 223602.Oct 7 2019, 7:38 AM

Moved target-specific code to target_impl.h

we for sure need to update the name and maybe commit message.

Assuming the name & commit message are adjusted I'm withdrawing my objections.

I was hoping to look into this issue because I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is. Unfortunately, we still do not have a proper CI infrastructure to test things and I seem to not find the time to reproduce the issue on my end. (anyone else interested in creating a reproducer for this? @Hahnfeld maybe?)

I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is

That seems fair. At the least, the comment could be more specific.

I thought the salient part of the patch was "memory" working around a compiler bug that reordered memory accesses incorrectly. On a closer look though, the comment suggests the fix is actually the .volatile. qualifier on the instruction influencing ptxas. If so, a volatile qualifier on the byte array would presumably also be a fix. Why is the underlying byte array not volatile qualified?

I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is

That seems fair. At the least, the comment could be more specific.

I thought the salient part of the patch was "memory" working around a compiler bug that reordered memory accesses incorrectly. On a closer look though, the comment suggests the fix is actually the .volatile. qualifier on the instruction influencing ptxas. If so, a volatile qualifier on the byte array would presumably also be a fix. Why is the underlying byte array not volatile qualified?

I will fix the description and the title. The fix is strictly for cuda8, cuda9 and later do not have this problem.
as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

I have read the thread. There's a long discussion about volatile vs atomic. There is no justification for marking the accesses volatile via asm instead of marking the byte array as volatile.

The semantics of ptx volatile are stronger than those of llvm. Weird, but not a justification for omitting the volatile qualifier from the C in this change.

Assuming llvm doesn't spuriously drop volatile (which would be a severe bug), and that the lowering to ptx doesn't drop volatile (which also seems likely to be fatal), a volatile array in C would lead to the same memory operations introduced by the asm. That's why I thought it was the memory clobber that mattered.

as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

I have read the thread. There's a long discussion about volatile vs atomic. There is no justification for marking the accesses volatile via asm instead of marking the byte array as volatile.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

The semantics of ptx volatile are stronger than those of llvm. Weird, but not a justification for omitting the volatile qualifier from the C in this change.

Assuming llvm doesn't spuriously drop volatile (which would be a severe bug), and that the lowering to ptx doesn't drop volatile (which also seems likely to be fatal), a volatile array in C would lead to the same memory operations introduced by the asm. That's why I thought it was the memory clobber that mattered.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

I don't think so. Phabricator doesn't make the discussion easy to follow relative to the changes in code, but I'm pretty sure the sequence was:

  • Mark the array volatile in the initial patch
  • Long discussion about volatile vs atomic
  • Agreement that the control flow requires atomic semantics
  • Discussion about how to express that in cuda
  • Suggestion to "declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it"
  • Drop the volatile qualification on the array and use inline asm to increment it
  • Write a comment saying a volatile access is used to avoid dangerous optimizations
  • We start this exchange

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

I don't think so. Phabricator doesn't make the discussion easy to follow relative to the changes in code, but I'm pretty sure the sequence was:

  • Mark the array volatile in the initial patch
  • Long discussion about volatile vs atomic
  • Agreement that the control flow requires atomic semantics
  • Discussion about how to express that in cuda
  • Suggestion to "declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it"
  • Drop the volatile qualification on the array and use inline asm to increment it
  • Write a comment saying a volatile access is used to avoid dangerous optimizations
  • We start this exchange

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic, we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not. The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

JonChesterfield added a comment.EditedOct 7 2019, 6:42 PM

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic,

I didn't say it should be atomic. I said the data should be volatile and the comment updated.

This patch prevents llvm from moving memory accesses across other memory operations by using a memory clobber in the inline assembly. That the asm also contains the word volatile is distracting but irrelevant to the transforms in llvm.

The PTX compiler presumably looks at the memory clobber or parses the inline asm - either would suffice there.

we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not.

This discussion is complicated by two notions of volatile (llvm, ptx) and four compilers (clang, llvm, ptx, nvcc). Consequently it's ambiguous which barriers you're referring to but I don't think that matters here. I suspect 'should not' is contentious as it requires llvm to treat volatile differently when compiling for nvptx.

The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

This patch doesn't mark memory accesses as volatile in llvm. It might do in ptx, if that compiler parses inline asm. This patch does introduce a memory clobber which prevents reordering operations.

Harbormaster completed remote builds in B39089: Diff 223609.

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic,

I didn't say it should be atomic. I said the data should be volatile and the comment updated.

This patch prevents llvm from moving memory accesses across other memory operations by using a memory clobber in the inline assembly. That the asm also contains the word volatile is distracting but irrelevant to the transforms in llvm.

No, not in LLVM. LLVM behaves correctly, it is ptxas from CUDA8 does some wrong optimizations.

The PTX compiler presumably looks at the memory clobber or parses the inline asm - either would suffice there.

we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not.

This discussion is complicated by two notions of volatile (llvm, ptx) and four compilers (clang, llvm, ptx, nvcc). Consequently it's ambiguous which barriers you're referring to but I don't think that matters here. I suspect 'should not' is contentious as it requires llvm to treat volatile differently when compiling for nvptx.

The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

This patch doesn't mark memory accesses as volatile in llvm. It might do in ptx, if that compiler parses inline asm. This patch does introduce a memory clobber which prevents reordering operations.

We don't need to fix anything in LLVM, we just need to prevent some optimizations in ptxas. PTX inline code does exactly what we need here.

jlebar added a comment.Oct 8 2019, 6:58 AM

Hi, jumping in here. I haven't read the whole bug, apologies.

ptxas from CUDA 8 has *many* known bugs. I would also strongly oppose attempting to work around them in LLVM.

If you cannot upgrade from CUDA 8, you can still take a newer ptxas binary and use it in combination with the rest of CUDA 8. We have done this at Google for years now with no problems, and with a blessing from nvidia.

If you encounter bugs in the latest ptxas and can provide a reproducer, we can file bugs against nvidia if you cannot. (Well, I guess I'm volunteering tra, I don't work on this anymore. :) I'm not opposed to checking in workarounds for bugs in *the latest* ptxas if we have a process to remove these workarounds soon after a newer ptxas is available (i.e. we don't say, "remove after three ptxas releases" or something). I would strongly oppose keeping workarounds for old ptxas versions because that would greatly complicate the NVPTX backend and have little benefit.

Then we're finally on the same page. Thanks for your patience.

Hi, jumping in here. I haven't read the whole bug, apologies.

ptxas from CUDA 8 has *many* known bugs. I would also strongly oppose attempting to work around them in LLVM.

If you cannot upgrade from CUDA 8, you can still take a newer ptxas binary and use it in combination with the rest of CUDA 8. We have done this at Google for years now with no problems, and with a blessing from nvidia.

If you encounter bugs in the latest ptxas and can provide a reproducer, we can file bugs against nvidia if you cannot. (Well, I guess I'm volunteering tra, I don't work on this anymore. :) I'm not opposed to checking in workarounds for bugs in *the latest* ptxas if we have a process to remove these workarounds soon after a newer ptxas is available (i.e. we don't say, "remove after three ptxas releases" or something). I would strongly oppose keeping workarounds for old ptxas versions because that would greatly complicate the NVPTX backend and have little benefit.

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8. It means that, unfortunately, we need to support cuda8 in full, with all bugs and problems.
I'm not happy to add it, but we just need to do this. Later, when we drop the support for Cuda8, we can remove this workaround and just use regular code.

ABataev updated this revision to Diff 223845.Oct 8 2019, 7:10 AM

Fixed title and description, rebase.

jlebar requested changes to this revision.Oct 8 2019, 7:35 AM

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8.

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

If you want to un-support CUDA 8 in LLVM, fine by me. If in your own personal setup you want to use the ptxas from CUDA 10.1 with the rest of the CUDA 8 toolkit, that should also work.

This revision now requires changes to proceed.Oct 8 2019, 7:35 AM

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8.

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

If you want to un-support CUDA 8 in LLVM, fine by me. If in your own personal setup you want to use the ptxas from CUDA 10.1 with the rest of the CUDA 8 toolkit, that should also work.

Fine, then no more optimizations in the runtime unless we drop support for Cuda8.

ABataev abandoned this revision.Oct 8 2019, 7:50 AM

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

Ideologically I'm right there with you. Killing off cuda 8 support works for me - there's presumably an official deprecation process we should follow. Announcements and so forth.

Pragmatically, you're overlooking a couple of things. We don't need to cover many of the bugs to keep the deviceRTL working - it's 6k lines of relatively straightforward cuda that will exercise a small subset of the nvptx toolchain.

More fundamentally, it may be necessary to keep cuda 8 working despite the broken toolchain. People write code against buggy compilers, and when the compiler is changed, their code behaves differently. The use case is wanting to use a recent llvm, e.g. for x86 related security patches, while also continuing to use a buggy and outdated cuda toolchain so their numerical codes keep getting the same answers.

So if we can drop cuda 8, great. If some of the maintainers have a customer that requires cuda 8 to continue working with a recent llvm - which is sad but plausible - we're better off supporting that use case in tree than forcing an out of tree fork.

jlebar added a comment.Oct 8 2019, 8:51 AM

So if we can drop cuda 8, great. If some of the maintainers have a customer that requires cuda 8 to continue working with a recent llvm - which is sad but plausible - we're better off supporting that use case in tree than forcing an out of tree fork.

This one revision, which attempts to work around one bug out of many in CUDA 8's ptxas, is months old and has about a hundred comments from many maintainers. The cost of going down this road is huge and imposed on all backend maintainers. You can see that here in this review, where tra got pulled in repeatedly, despite not having any direct interest in openmp.

As a primary maintainer of the NVPTX backend I would indeed prefer that someone fork LLVM than ask me for assistance supporting this old, buggy ptxas.

I have been down this road, it was my life for three years. Ignore my cries at your own peril, etc etc.

We don't need to cover many of the bugs to keep the deviceRTL working - it's 6k lines of relatively straightforward cuda that will exercise a small subset of the nvptx toolchain.

My objection isn't about deviceRTL specifically.

If those working on deviceRTL/openmp promise never to involve NVPTX backend maintainers in any discussions involving changes to deviceRTL/openmp source code to work around bugs in old ptxas, and promise never to suggest changes to clang or LLVM proper to work around such bugs, then you have my blessing to do whatever you'd like in your source code, just like any non-LLVM project which uses clang/llvm can make whatever changes to their code to work around whatever bugs.

In practice this would probably mean we'd need to set up email filters to send anything containing "openmp" to /dev/null. But I think that's not the kind of community we want to build? I want to partner with you all, and I think you all benefit from partnering with us. Thus the need for whole-community standards as far as what we do and don't support.

jlebar added a comment.EditedOct 8 2019, 11:57 AM

I spoke offline with tra@ about this; he was confused why I was coming down so hard here.

For lack of a better word, I think I have some PTSD from customers hitting by ptxas bugs and then expecting me to fix/work around them. I know that even very simple CUDA code can hit bugs in old ptxas. There are even data-dependent bugs which only manifest with certain pointer values. This makes it difficult to have confidence that any particular function works properly, even if you've tested it thoroughly. However awful you think it is, it's worse.

We've been working on this for a long time and have found that the only thing which consistently works is using the newest ptxas. Upgrading is painful, but using a buggy optimizing assembler is worse.

I think the root reason I came down hard here is because I was/am afraid that this change would be used as justification for trying to work around even a subset of these bugs in LLVM itself. That would be a Sisyphean task and so I don't want us to sign up for it.

If you all are willing to say that this patch won't be used as precedent to argue in favor of making changes to clang or LLVM for compatibility with ptxas 8.0, then I withdraw my objections here.