Page MenuHomePhabricator

[OPENMP][NVPTX]Mark parallel level counter as volatile.
AbandonedPublic

Authored by ABataev on May 24 2019, 7:47 AM.

Details

Summary

The parallel level counter is accessed simultaneously by many threads. In combination with atomic operations used in full runtime SPMD mode (fired by dynamic scheduling, for example) it may lead to incorrect results caused by compiler optimizations. According to the CUDA Toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#volatile-qualifier):

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions and memory visibility semantics of synchronization functions.

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.

This is especially important in case of thread divergence mixed with atomic operations. In our case, this may lead to undefined behavior or thread deadlock, especially on CUDA 9 and later. This change is required especially for SPMD mode with full runtime or SPMD mode without full runtime mode but with atomic operations.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tra added a comment.Jun 13 2019, 10:06 AM
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

In D62393#1542042, @tra wrote:
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

Yes, this what I'm going to do.

According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.

Wrong SO link? The question does not seem to be CUDA-related.

As for advise to use volatile with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.

I know, also fixed couple of bugs here masked with volatile but not in this particular case.

Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.

Yes, I checked this and relied on fact that clang generates .volatile kind of ops.
I know, it is not guaranteed. Anyway, we could use conditional compilation in case of any changes in future versions. But ptx asm instructions are the best option here, I think.

In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

Yes, this is what I need and what I'm trying to explain. Thanks, Olivier, you did it better than me.

In the end the mental model is simple: think as though you had std::atomic<T> available to you, the conclusions you draw and the code you write with that will be correct.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

tra added a comment.Jun 13 2019, 10:42 AM
In D62393#1542042, @tra wrote:

C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I think it's completely healthy and fair to be skeptical that we'll do the right thing by you, but FWIW the changes I want to make to CUDA volatile after we ship std::atomic<T> go in the other direction. And the legacy requirements are pretty entrenched here, too.

I didn't mean this to be a critical statement. I apologize if it came out as such. The example was purely hypothetical to illustrate the point that one should not assume that a C++ qualifier will always do what C++ standard does not guarantee.

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

Several threads in the same warp access the same parallelLevel array element. The compiler optimize out some of the accesses (because it is not aware of the access in several threads) and it leads to undefined behaviour.

and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).

@jdoerfert also asked:

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

And either the requested documentation should be added, or if it already exists, you should point that out.

I agree, that documentation should be added, I already said this. But this is not related directly to this patch.

Your reply to @Hahnfeld:

... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.

this should also appear in a comment in the code.

In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Yes, I agree, and I can answer the technical details, if they are productive and lead to the best code quality.

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

reames removed a subscriber: reames.Jun 13 2019, 11:33 AM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.). I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward. The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed. It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions. And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.
I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined. Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.
As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them. He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3. And you need the latest version of the libomptarget, clang-ykt version is implemented differently and has no this problem.

@ye-luo ran it with Clang-ykt (2019-05-27) on a Quadro P1000 and I ran Clang (Jun 13 15:24:11 2019 : 3bc6e2a7aa3853b06045c42e81af094647c48676) on a V100 so far.

#include <stdio.h>

int main() {

  for (int i = 0; i < 1000; ++i) {
    int Count = 0;
#pragma omp target parallel for reduction(+: Count) schedule(dynamic, 2) num_threads(64)
    for (int J = 0; J < 1000; ++J) {
      Count += J;
    }
    if (Count != 499500)
      printf("ERROR [@%i] %i\n", i, Count);
  }

  // Final result of Count is 1000 * (999-0) / 2
  // CHECK: Expected count with dynamic scheduling = 499500
  // printf("Expected count with dynamic scheduling = %d\n", Count);
}

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

  • Complex technical discussions with commitments towards improving the patches and review process ([4] to name one), and
  • Various comments and accusations that people found inappropriate ([1,5-11] to restrict it to these patches only). The one you refer to is in [1] and in my follow up [12] I agreed that my words were not well chosen.

[0] https://reviews.llvm.org/D62199#1512617
[1] https://reviews.llvm.org/D62393#1528421
[2] https://reviews.llvm.org/D62199#1515182
[3] https://reviews.llvm.org/D62199#1517073

[4] https://reviews.llvm.org/D62199#1515072

[5] https://reviews.llvm.org/D62199#1512638
[6] https://reviews.llvm.org/D62199#1513126
[7] https://reviews.llvm.org/D62199#1514027
[8] https://reviews.llvm.org/D62393#1533461
[9] https://reviews.llvm.org/D62393#1539086
[10] https://reviews.llvm.org/D62393#1539630
[11] https://reviews.llvm.org/D62393#1542505

[12] https://reviews.llvm.org/D62199#1515182

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

[0] was enough for me to treat all your next actions in a certain sense. Your questions do not mean anything, you already made the decision and all the questions were just to confirm this first judgement.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

hfinkel added a comment.EditedJun 13 2019, 6:06 PM

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

Good!

As for the documentation, yes, we lack of the documentation for libomptarget. This is a long story and it requires special work. I'm also not very happy with the current situation. But the documentation is a separate work. It requires special set of patches with the design decisions etc. Most of it follows the design of the original libomp, bit has some NVPTX specific parts.

Yes, we should have documentation covering the design decisions and other details, but I don't believe that treating this as a special task is productive. When a subsystem lacks good documentation, then we should improve that as we make changes to the code - just because the documentation is poor is not a good reason for not adding comments in the code now. It is, moreover, an even stronger argument for adding comments in the code now. Please do add whatever documentation is possible and associated with this change now.

We need the base where we could start. It is a separate work to prepare a base document which could be modified/upgraded with the next patches.

The lack of existing overall documentation does not prevent adding a comment explaining this particular variable, what it stores and why it is marked as volatile, now.

Sure, I'll add a comment for a variable describing its purpose and why it must be marked volatile.

I think that all of these issues have now been covered here (and on D62199, etc.), and so I won't repeat them again. The reviewers (e.g., @grokos for the case of D62199) almost certainly believed that they were being helpful and moving the development process forward.

The fact that the patches did not all follow our best practices, etc. makes Johannes "suspicious" that the patches could have reasonably been improved before being committed.

Which one do not follow "best practices"? The combined one? This is only one patch. NFC patches? They are refactoring patches and we already have the tests for them.

Again, it is not just one patch, and there have been a variety of issues. Some have already been addressed and others we're addressing here. The important point is to make sure that things improve over time, and that's exactly what we can do now.

He did not try to understand the idea of the patches, instead he decided to treat them "suspicious" without any comments.

This statement is unproductive, and likely false. Johannes has spent a lot of time looking at the past patches and the code. If he sees a systematic problem, then there probably is one.

Let me disagree with you.

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

Again, this is unproductive, and false. I agree with Johannes could have been more polite at times - frustration can be stated directly, and that's almost always more productive - but let's be fair: there's a significant corpus of text where you're rude, dismissive, abrasive, and so on. I can provide links if you'd like, but I think you very well know that it's true. That should stop. Nevertheless, you expect everyone else to assume that you're working in good faith, have fairly considered and addressed their feedback, and so on. I suggest that you provide Johannes in this case this came courtesy you expect others to provide to you.

Sure, I was rude. When somebody blames me without any proves, I can be rude. I'm sorry for this! But it was caused by the unconstructive initial comment.

At this point, I feel as though everyone here as voiced their concerns about past practices and behavior. We understand how to address each others concerns going forward and maintain professional discourse. We should do so. I highly recommend that, at this point, we avoid future comments that a reader is likely to view as rude, condescending, dismissive, or confrontational. Any suboptimal code, documentation, etc. in previous patches can be improved by future patches, and that should be our focus going forward.

And, I'll point out, that we have indeed accomplished a lot in terms of OpenMP feature enablement, etc. At this point, we have more eyes on the process and we'll should all work together to produce an even-better result. The glass, in this case, really is half full.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

Will provide additional info on Tuesday. Most probably, this simplified test does not trigger the problem in your c9nfiguration. Will send the original complex test that triggers the problem.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?
If so, what part of the D62318 patch is causing the problem?

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

Checked with Cuda9, it works. Most probably, the problem is related to Cuda8 only. Most probably, there were some optimizations that were fixed in Cuda9 (in ptxas tool). Clang defines macro CUDA_VERSION and sets it to 8000 for Cuda8. I can check this macro and use volatile modifiers only for Cuda8.
To reproduce the problem, you need to build the runtime without debug info and build the test at O3.

ABataev updated this revision to Diff 206671.Jun 26 2019, 7:54 AM

Use inline assembler to implement volatile load/stores for parallelLevel.

tra added inline comments.Jun 26 2019, 8:50 AM
libomptarget/deviceRTLs/nvptx/src/supporti.h
97–101

Perhaps it can be deduped into something like this:

#ifdef __LP64__
#define PTR_CONSTRAINT               "l"
#else  // __LP64__
#define PTR_CONSTRAINT               "r"
#endif // __LP64__

... 
asm volatile("st.volatile.u8 [%1], %0;" ::"r"(Val), : PTR_CONSTRAINT (&parLevel) 
             : "memory");
ABataev marked an inline comment as done.Jun 26 2019, 8:54 AM
ABataev added inline comments.
libomptarget/deviceRTLs/nvptx/src/supporti.h
97–101

Sure, thanks.

ABataev updated this revision to Diff 206693.Jun 26 2019, 9:01 AM

Simplified constraint.

Thanks for the explicit asm. One of the hazards of compiling cuda for amdgcn is that volatile doesn't imply atomic, so this is a clear warning that I'll have a bug on merge.

libomptarget/deviceRTLs/nvptx/src/supporti.h
121

Could this function call getParallelLevel, increment/decrement the result, then call setParallelLevel?

ABataev updated this revision to Diff 219739.Sep 11 2019, 9:53 AM

Used getParallelLevel/setParallelLevel instead of explicit asm instructions in changeParallelLevel.

ABataev marked an inline comment as done.Sep 11 2019, 9:53 AM

It is probably not directly related to this patch, but I think the logic of omp_in_parallel() is not valid. It only works, because the level counting is also broken. This code

int main(){
#pragma omp target 
  #pragma omp parallel if(0)
    if(omp_get_thread_num()==0) {
      printf("omp_in_parallel=%i\n", omp_in_parallel()); 
      printf("omp_get_level=%i\n", omp_get_level()); 
      printf("omp_get_active_level=%i\n", omp_get_active_level()); 
    }
  return 0;
}

should print (and does so with env OMP_TARGET_OFFLOAD=disabled):

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

As I understand the code for omp_in_parallel(), the example code would print omp_in_parallel=1 if the level would correctly be increased.

It is probably not directly related to this patch, but I think the logic of omp_in_parallel() is not valid. It only works, because the level counting is also broken. This code

int main(){
#pragma omp target 
  #pragma omp parallel if(0)
    if(omp_get_thread_num()==0) {
      printf("omp_in_parallel=%i\n", omp_in_parallel()); 
      printf("omp_get_level=%i\n", omp_get_level()); 
      printf("omp_get_active_level=%i\n", omp_get_active_level()); 
    }
  return 0;
}

should print (and does so with env OMP_TARGET_OFFLOAD=disabled):

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

As I understand the code for omp_in_parallel(), the example code would print omp_in_parallel=1 if the level would correctly be increased.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

Ok, thanks for testing. I probably didn't apply the patch correctly, then.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

I got above output only with host offloading, so that part also seems to be fine.

After revisiting the code I understood, that the value returned by getParallelLevel is basically a bitfield: struct{int active_parallel_level : 1; int parallel_level : 7;}.

You're wrong. My output:

omp_in_parallel=0
omp_get_level=1
omp_get_active_level=0

So, omp_in_parallel works correctly on GPU.

Ok, thanks for testing. I probably didn't apply the patch correctly, then.

If it does not work correctly with OMP_TARGET_OFFLOAD=disabled, the problem is definitely in libomp, not libomptarget. With disabled offloading the code is not offloaded to GPU.

I got above output only with host offloading, so that part also seems to be fine.

After revisiting the code I understood, that the value returned by getParallelLevel is basically a bitfield: struct{int active_parallel_level : 1; int parallel_level : 7;}.

Yes, you got it!

There was a lot of discussion here, largely about volatile vs atomic. The final patch looks uncontentious to me - accesses to parallelLevel are now wrapped in calls that implement atomic access in the appropriate fashion for nvptx. I agree that the test is at coarse granularity, but as it's essentially poking at the race condition I can't see a better way to do it.

I'm confident in this patch as it stands. @jdoerfert?

@grokos , Georgios, could you approve the patch if there are no more comments?

So this changed over time, we for sure need to update the name and maybe commit message.

As @JonChesterfield pointed out, the wrapping of the accesses in function calls is fine.
The two functions with inline assembly should live in the target_impl.h though.

At some point it was said this fixes a problem that only occurs with D62318 applied.
Is this still the case? If not, we need to specify exactly how we can reproduce this issue, e.g., in a comment next to the target offload region.

So this changed over time, we for sure need to update the name and maybe commit message.

As @JonChesterfield pointed out, the wrapping of the accesses in function calls is fine.
The two functions with inline assembly should live in the target_impl.h though.

At some point it was said this fixes a problem that only occurs with D62318 applied.
Is this still the case? If not, we need to specify exactly how we can reproduce this issue, e.g., in a comment next to the target offload region.

Yes, it still the case. The cuda8 compiler is too optimistic and makes some incorrect optimizations with D62318. This patch prevents these optimizations. Cuda9 and later have no such problems.

ABataev updated this revision to Diff 223602.Mon, Oct 7, 7:38 AM

Moved target-specific code to target_impl.h

we for sure need to update the name and maybe commit message.

Assuming the name & commit message are adjusted I'm withdrawing my objections.

I was hoping to look into this issue because I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is. Unfortunately, we still do not have a proper CI infrastructure to test things and I seem to not find the time to reproduce the issue on my end. (anyone else interested in creating a reproducer for this? @Hahnfeld maybe?)

I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is

That seems fair. At the least, the comment could be more specific.

I thought the salient part of the patch was "memory" working around a compiler bug that reordered memory accesses incorrectly. On a closer look though, the comment suggests the fix is actually the .volatile. qualifier on the instruction influencing ptxas. If so, a volatile qualifier on the byte array would presumably also be a fix. Why is the underlying byte array not volatile qualified?

I'm *very* unhappy to claim "The cuda8 compiler is too optimistic and makes some incorrect optimizations" without actually understanding what the problem is

That seems fair. At the least, the comment could be more specific.

I thought the salient part of the patch was "memory" working around a compiler bug that reordered memory accesses incorrectly. On a closer look though, the comment suggests the fix is actually the .volatile. qualifier on the instruction influencing ptxas. If so, a volatile qualifier on the byte array would presumably also be a fix. Why is the underlying byte array not volatile qualified?

I will fix the description and the title. The fix is strictly for cuda8, cuda9 and later do not have this problem.
as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

I have read the thread. There's a long discussion about volatile vs atomic. There is no justification for marking the accesses volatile via asm instead of marking the byte array as volatile.

The semantics of ptx volatile are stronger than those of llvm. Weird, but not a justification for omitting the volatile qualifier from the C in this change.

Assuming llvm doesn't spuriously drop volatile (which would be a severe bug), and that the lowering to ptx doesn't drop volatile (which also seems likely to be fatal), a volatile array in C would lead to the same memory operations introduced by the asm. That's why I thought it was the memory clobber that mattered.

as tot he volatile modifi3r, read the whole thread. There was a big discussion about it.

I have read the thread. There's a long discussion about volatile vs atomic. There is no justification for marking the accesses volatile via asm instead of marking the byte array as volatile.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

The semantics of ptx volatile are stronger than those of llvm. Weird, but not a justification for omitting the volatile qualifier from the C in this change.

Assuming llvm doesn't spuriously drop volatile (which would be a severe bug), and that the lowering to ptx doesn't drop volatile (which also seems likely to be fatal), a volatile array in C would lead to the same memory operations introduced by the asm. That's why I thought it was the memory clobber that mattered.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

I don't think so. Phabricator doesn't make the discussion easy to follow relative to the changes in code, but I'm pretty sure the sequence was:

  • Mark the array volatile in the initial patch
  • Long discussion about volatile vs atomic
  • Agreement that the control flow requires atomic semantics
  • Discussion about how to express that in cuda
  • Suggestion to "declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it"
  • Drop the volatile qualification on the array and use inline asm to increment it
  • Write a comment saying a volatile access is used to avoid dangerous optimizations
  • We start this exchange

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

It means that you missed something. Initially, patch marked the whole array as volatile but after some discussion it was decided to use ptx instructions directly to avoid some side effects.

I don't think so. Phabricator doesn't make the discussion easy to follow relative to the changes in code, but I'm pretty sure the sequence was:

  • Mark the array volatile in the initial patch
  • Long discussion about volatile vs atomic
  • Agreement that the control flow requires atomic semantics
  • Discussion about how to express that in cuda
  • Suggestion to "declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it"
  • Drop the volatile qualification on the array and use inline asm to increment it
  • Write a comment saying a volatile access is used to avoid dangerous optimizations
  • We start this exchange

    The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic, we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not. The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

JonChesterfield added a comment.EditedMon, Oct 7, 6:42 PM

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic,

I didn't say it should be atomic. I said the data should be volatile and the comment updated.

This patch prevents llvm from moving memory accesses across other memory operations by using a memory clobber in the inline assembly. That the asm also contains the word volatile is distracting but irrelevant to the transforms in llvm.

The PTX compiler presumably looks at the memory clobber or parses the inline asm - either would suffice there.

we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not.

This discussion is complicated by two notions of volatile (llvm, ptx) and four compilers (clang, llvm, ptx, nvcc). Consequently it's ambiguous which barriers you're referring to but I don't think that matters here. I suspect 'should not' is contentious as it requires llvm to treat volatile differently when compiling for nvptx.

The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

This patch doesn't mark memory accesses as volatile in llvm. It might do in ptx, if that compiler parses inline asm. This patch does introduce a memory clobber which prevents reordering operations.

Harbormaster completed remote builds in B39089: Diff 223609.

The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.

Not again! We don't need to make it atomic,

I didn't say it should be atomic. I said the data should be volatile and the comment updated.

This patch prevents llvm from moving memory accesses across other memory operations by using a memory clobber in the inline assembly. That the asm also contains the word volatile is distracting but irrelevant to the transforms in llvm.

No, not in LLVM. LLVM behaves correctly, it is ptxas from CUDA8 does some wrong optimizations.

The PTX compiler presumably looks at the memory clobber or parses the inline asm - either would suffice there.

we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not.

This discussion is complicated by two notions of volatile (llvm, ptx) and four compilers (clang, llvm, ptx, nvcc). Consequently it's ambiguous which barriers you're referring to but I don't think that matters here. I suspect 'should not' is contentious as it requires llvm to treat volatile differently when compiling for nvptx.

The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.

This patch doesn't mark memory accesses as volatile in llvm. It might do in ptx, if that compiler parses inline asm. This patch does introduce a memory clobber which prevents reordering operations.

We don't need to fix anything in LLVM, we just need to prevent some optimizations in ptxas. PTX inline code does exactly what we need here.

jlebar added a comment.Tue, Oct 8, 6:58 AM

Hi, jumping in here. I haven't read the whole bug, apologies.

ptxas from CUDA 8 has *many* known bugs. I would also strongly oppose attempting to work around them in LLVM.

If you cannot upgrade from CUDA 8, you can still take a newer ptxas binary and use it in combination with the rest of CUDA 8. We have done this at Google for years now with no problems, and with a blessing from nvidia.

If you encounter bugs in the latest ptxas and can provide a reproducer, we can file bugs against nvidia if you cannot. (Well, I guess I'm volunteering tra, I don't work on this anymore. :) I'm not opposed to checking in workarounds for bugs in *the latest* ptxas if we have a process to remove these workarounds soon after a newer ptxas is available (i.e. we don't say, "remove after three ptxas releases" or something). I would strongly oppose keeping workarounds for old ptxas versions because that would greatly complicate the NVPTX backend and have little benefit.

Then we're finally on the same page. Thanks for your patience.

Hi, jumping in here. I haven't read the whole bug, apologies.

ptxas from CUDA 8 has *many* known bugs. I would also strongly oppose attempting to work around them in LLVM.

If you cannot upgrade from CUDA 8, you can still take a newer ptxas binary and use it in combination with the rest of CUDA 8. We have done this at Google for years now with no problems, and with a blessing from nvidia.

If you encounter bugs in the latest ptxas and can provide a reproducer, we can file bugs against nvidia if you cannot. (Well, I guess I'm volunteering tra, I don't work on this anymore. :) I'm not opposed to checking in workarounds for bugs in *the latest* ptxas if we have a process to remove these workarounds soon after a newer ptxas is available (i.e. we don't say, "remove after three ptxas releases" or something). I would strongly oppose keeping workarounds for old ptxas versions because that would greatly complicate the NVPTX backend and have little benefit.

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8. It means that, unfortunately, we need to support cuda8 in full, with all bugs and problems.
I'm not happy to add it, but we just need to do this. Later, when we drop the support for Cuda8, we can remove this workaround and just use regular code.

ABataev updated this revision to Diff 223845.Tue, Oct 8, 7:10 AM

Fixed title and description, rebase.

jlebar requested changes to this revision.Tue, Oct 8, 7:35 AM

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8.

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

If you want to un-support CUDA 8 in LLVM, fine by me. If in your own personal setup you want to use the ptxas from CUDA 10.1 with the rest of the CUDA 8 toolkit, that should also work.

This revision now requires changes to proceed.Tue, Oct 8, 7:35 AM

No, I don't try to add a workaround for the bug in LLVM since there are no problems in LLVM optimizations. The same code works correctly with Cuda9.2. But I would like to add it for cuda8 since we need to fully support it unless we drop the support for cuda8.

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

If you want to un-support CUDA 8 in LLVM, fine by me. If in your own personal setup you want to use the ptxas from CUDA 10.1 with the rest of the CUDA 8 toolkit, that should also work.

Fine, then no more optimizations in the runtime unless we drop support for Cuda8.

ABataev abandoned this revision.Tue, Oct 8, 7:50 AM

Based on this and my comment above, I strongly oppose this patch.

ptxas from CUDA 8 has many, many known bugs. I oppose working around them in LLVM; it will eat up a huge amount of maintainers' time, and you will never cover all or even many of the bugs.

Ideologically I'm right there with you. Killing off cuda 8 support works for me - there's presumably an official deprecation process we should follow. Announcements and so forth.

Pragmatically, you're overlooking a couple of things. We don't need to cover many of the bugs to keep the deviceRTL working - it's 6k lines of relatively straightforward cuda that will exercise a small subset of the nvptx toolchain.

More fundamentally, it may be necessary to keep cuda 8 working despite the broken toolchain. People write code against buggy compilers, and when the compiler is changed, their code behaves differently. The use case is wanting to use a recent llvm, e.g. for x86 related security patches, while also continuing to use a buggy and outdated cuda toolchain so their numerical codes keep getting the same answers.

So if we can drop cuda 8, great. If some of the maintainers have a customer that requires cuda 8 to continue working with a recent llvm - which is sad but plausible - we're better off supporting that use case in tree than forcing an out of tree fork.

jlebar added a comment.Tue, Oct 8, 8:51 AM

So if we can drop cuda 8, great. If some of the maintainers have a customer that requires cuda 8 to continue working with a recent llvm - which is sad but plausible - we're better off supporting that use case in tree than forcing an out of tree fork.

This one revision, which attempts to work around one bug out of many in CUDA 8's ptxas, is months old and has about a hundred comments from many maintainers. The cost of going down this road is huge and imposed on all backend maintainers. You can see that here in this review, where tra got pulled in repeatedly, despite not having any direct interest in openmp.

As a primary maintainer of the NVPTX backend I would indeed prefer that someone fork LLVM than ask me for assistance supporting this old, buggy ptxas.

I have been down this road, it was my life for three years. Ignore my cries at your own peril, etc etc.

We don't need to cover many of the bugs to keep the deviceRTL working - it's 6k lines of relatively straightforward cuda that will exercise a small subset of the nvptx toolchain.

My objection isn't about deviceRTL specifically.

If those working on deviceRTL/openmp promise never to involve NVPTX backend maintainers in any discussions involving changes to deviceRTL/openmp source code to work around bugs in old ptxas, and promise never to suggest changes to clang or LLVM proper to work around such bugs, then you have my blessing to do whatever you'd like in your source code, just like any non-LLVM project which uses clang/llvm can make whatever changes to their code to work around whatever bugs.

In practice this would probably mean we'd need to set up email filters to send anything containing "openmp" to /dev/null. But I think that's not the kind of community we want to build? I want to partner with you all, and I think you all benefit from partnering with us. Thus the need for whole-community standards as far as what we do and don't support.

jlebar added a comment.EditedTue, Oct 8, 11:57 AM

I spoke offline with tra@ about this; he was confused why I was coming down so hard here.

For lack of a better word, I think I have some PTSD from customers hitting by ptxas bugs and then expecting me to fix/work around them. I know that even very simple CUDA code can hit bugs in old ptxas. There are even data-dependent bugs which only manifest with certain pointer values. This makes it difficult to have confidence that any particular function works properly, even if you've tested it thoroughly. However awful you think it is, it's worse.

We've been working on this for a long time and have found that the only thing which consistently works is using the newest ptxas. Upgrading is painful, but using a buggy optimizing assembler is worse.

I think the root reason I came down hard here is because I was/am afraid that this change would be used as justification for trying to work around even a subset of these bugs in LLVM itself. That would be a Sisyphean task and so I don't want us to sign up for it.

If you all are willing to say that this patch won't be used as precedent to argue in favor of making changes to clang or LLVM for compatibility with ptxas 8.0, then I withdraw my objections here.