Page MenuHomePhabricator

[OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC.
AbandonedPublic

Authored by ABataev on Mon, Jul 22, 1:50 PM.

Details

Reviewers
grokos
jdoerfert
Summary

Due to architectural changes, the minimal chunk of threads that can
executed separately without undefined bahavior on Cuda9+ is a warp, i.e.
the whole warp must be executed in convergence. Otherwise, the runtime
produces either incorrect results, or (in most cases) just hangs because
of the synchronous nature of new shuffle, ballot, warpsync etc.
functions.
Reworked the test to match these limitations to make it work on cuda9+
(critical regions require warp size minimal blocks). Also, the code of
the critical sections must be fixed to make it finally work in Cuda9+
(currently it still does not work correctly, some extra changes to the runtime and compiler are required).

Event Timeline

ABataev created this revision.Mon, Jul 22, 1:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Jul 22, 1:50 PM

So if a user does this, the application will also hang?

So if a user does this, the application will also hang?

Yes, unfortunately. They changed the way they handle threads in warps. Before cuda9 the threads in the warps were implicitly convergent, in cuda 9+ it is not so.
Cuda introduced a whole bunch of new, _sync versions of primitives and a syncwarp function, that require an additional parameter - a mask of active threads in the warp. And all threads with this mask shall execute those primitives, otherwise the result is undefined. It is recommended to define these mask at the start of your code block because, since the threads are not explicitly convergent anymore, you cannot rely on the results of ballot(1) or __activemask() functions.
We have to hardcoded this mask to 0xffffffff because of this, i.e. the minimal granularity in cuda9+ is the whole warp (for the functionality that require to synchronize threads in the warp, like critical sections, reductions, etc.). Otherwise the behavior is undefined.

jdoerfert requested changes to this revision.Tue, Jul 23, 3:50 AM

So if a user does this, the application will also hang?

Yes, unfortunately. They changed the way they handle threads in warps. Before cuda9 the threads in the warps were implicitly convergent, in cuda 9+ it is not so.
Cuda introduced a whole bunch of new, _sync versions of primitives and a syncwarp function, that require an additional parameter - a mask of active threads in the warp. And all threads with this mask shall execute those primitives, otherwise the result is undefined. It is recommended to define these mask at the start of your code block because, since the threads are not explicitly convergent anymore, you cannot rely on the results of ballot(1) or __activemask() functions.
We have to hardcoded this mask to 0xffffffff because of this, i.e. the minimal granularity in cuda9+ is the whole warp (for the functionality that require to synchronize threads in the warp, like critical sections, reductions, etc.). Otherwise the behavior is undefined.

I have the feeling this just hides bugs in the runtime. I am also confused why this shows up now. What changed?

This revision now requires changes to proceed.Tue, Jul 23, 3:50 AM

So if a user does this, the application will also hang?

Yes, unfortunately. They changed the way they handle threads in warps. Before cuda9 the threads in the warps were implicitly convergent, in cuda 9+ it is not so.
Cuda introduced a whole bunch of new, _sync versions of primitives and a syncwarp function, that require an additional parameter - a mask of active threads in the warp. And all threads with this mask shall execute those primitives, otherwise the result is undefined. It is recommended to define these mask at the start of your code block because, since the threads are not explicitly convergent anymore, you cannot rely on the results of ballot(1) or __activemask() functions.
We have to hardcoded this mask to 0xffffffff because of this, i.e. the minimal granularity in cuda9+ is the whole warp (for the functionality that require to synchronize threads in the warp, like critical sections, reductions, etc.). Otherwise the behavior is undefined.

I have the feeling this just hides bugs in the runtime. I am also confused why this shows up now. What changed?

Read the docs, please, before saying anything like that.
Nothing has changed, it did not work in Cuda9+ since the very beginning.

So if a user does this, the application will also hang?

Yes, unfortunately. They changed the way they handle threads in warps. Before cuda9 the threads in the warps were implicitly convergent, in cuda 9+ it is not so.
Cuda introduced a whole bunch of new, _sync versions of primitives and a syncwarp function, that require an additional parameter - a mask of active threads in the warp. And all threads with this mask shall execute those primitives, otherwise the result is undefined. It is recommended to define these mask at the start of your code block because, since the threads are not explicitly convergent anymore, you cannot rely on the results of ballot(1) or __activemask() functions.
We have to hardcoded this mask to 0xffffffff because of this, i.e. the minimal granularity in cuda9+ is the whole warp (for the functionality that require to synchronize threads in the warp, like critical sections, reductions, etc.). Otherwise the behavior is undefined.

I have the feeling this just hides bugs in the runtime. I am also confused why this shows up now. What changed?

Read the docs, please, before saying anything like that.
Nothing has changed, it did not work in Cuda9+ since the very beginning.

That does not contradict what I said. The test seems to be valid from the OpenMP perspective, if you do not think so, say why.

Please stop replying simply with an URL and no explanation. If you think there is something in there that is important to the discussion, say what and where.

Also, please stop condescending comments as this one altogether.

So if a user does this, the application will also hang?

Yes, unfortunately. They changed the way they handle threads in warps. Before cuda9 the threads in the warps were implicitly convergent, in cuda 9+ it is not so.
Cuda introduced a whole bunch of new, _sync versions of primitives and a syncwarp function, that require an additional parameter - a mask of active threads in the warp. And all threads with this mask shall execute those primitives, otherwise the result is undefined. It is recommended to define these mask at the start of your code block because, since the threads are not explicitly convergent anymore, you cannot rely on the results of ballot(1) or __activemask() functions.
We have to hardcoded this mask to 0xffffffff because of this, i.e. the minimal granularity in cuda9+ is the whole warp (for the functionality that require to synchronize threads in the warp, like critical sections, reductions, etc.). Otherwise the behavior is undefined.

I have the feeling this just hides bugs in the runtime. I am also confused why this shows up now. What changed?

Read the docs, please, before saying anything like that.
Nothing has changed, it did not work in Cuda9+ since the very beginning.

That does not contradict what I said. The test seems to be valid from the OpenMP perspective, if you do not think so, say why.

It is valid. But we just can't support this kind of programs with cuda9+ because of the internal limitations. OpenMP is just not fully compatible with cuda 9+.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

(I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and the kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)

Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

(I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and the kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)

Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.

Thank you for this in-depth analysis. This makes way more sense now. I agree, we cannot deadlock/miscompile/forbid a conformant OpenMP program but instead need to implement the semantics correctly.

Btw., we checked XL and it does work as expected on this test case.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

Yep, this is what I'm going to fix in my next patches.
But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

(I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)

Even now it does not work, it returns an incorrect result but does not hang at least.

Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.

I'm not saying that we must keep the barrier, no. The barrier must be removed for sure. Bu the threads in the warp still must be synchronized. Otherwise, the result is not guaranteed. But in this case, we can operate only by the full warp, unfortunately.

Another possible solution could be just a spinlock in atomic operations to implement a lockless critical barrier. All the threads in the critical region must do this unconditionally. This is going to be much-much slower + I'm not sure that it will work. But if everybody is fine with the slower solution I can try to implement it.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

Yep, this is what I'm going to fix in my next patches.

I think we should fix this first instead of relaxing a test that fails for something that is easy to fix.

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

(I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)

Even now it does not work, it returns an incorrect result but does not hang at least.

Which one, my "hacked" version without the barrier or the changed test as proposed in this patch?

Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.

I'm not saying that we must keep the barrier, no. The barrier must be removed for sure. Bu the threads in the warp still must be synchronized. Otherwise, the result is not guaranteed. But in this case, we can operate only by the full warp, unfortunately.

Another possible solution could be just a spinlock in atomic operations to implement a lockless critical barrier. All the threads in the critical region must do this unconditionally. This is going to be much-much slower + I'm not sure that it will work. But if everybody is fine with the slower solution I can try to implement it.

I don't understand the full problem yet, so I can't comment on your proposal.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

Yep, this is what I'm going to fix in my next patches.

I think we should fix this first instead of relaxing a test that fails for something that is easy to fix.

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

(I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)

Even now it does not work, it returns an incorrect result but does not hang at least.

Which one, my "hacked" version without the barrier or the changed test as proposed in this patch?

The one from this patch. It does not hang anymore but returns an incorrect result. Anyway, the compiler still must be fixed.

Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.

I'm not saying that we must keep the barrier, no. The barrier must be removed for sure. Bu the threads in the warp still must be synchronized. Otherwise, the result is not guaranteed. But in this case, we can operate only by the full warp, unfortunately.

Another possible solution could be just a spinlock in atomic operations to implement a lockless critical barrier. All the threads in the critical region must do this unconditionally. This is going to be much-much slower + I'm not sure that it will work. But if everybody is fine with the slower solution I can try to implement it.

I don't understand the full problem yet, so I can't comment on your proposal.

Ignore my last sentence:

Another possible solution could be just a spinlock in atomic operations to implement a lockless critical barrier. All the threads in the critical region must do this unconditionally. This is going to be much-much slower + I'm not sure that it will work. But if everybody is fine with the slower solution I can try to implement it.

Forgot to remove it, it won't work. When you removed the barrier, you get this solution already. But it is not stable.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

Yep, this is what I'm going to fix in my next patches.

I think we should fix this first instead of relaxing a test that fails for something that is easy to fix.

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

At first I didn't, but now the original test case with added critical in the else branch works with full optimization iff I completely remove the specialization CGOpenMPRuntimeNVPTX::emitCriticalRegion.

I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the J loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.

Yep, this is what I'm going to fix in my next patches.

I think we should fix this first instead of relaxing a test that fails for something that is easy to fix.

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

At first I didn't, but now the original test case with added critical in the else branch works with full optimization iff I completely remove the specialization CGOpenMPRuntimeNVPTX::emitCriticalRegion.

But again, it just masks the real problem but does not solve it. It is again just a pure coincidence that it returns the expected result.

for (int I = 0; I < 32; ++I) {
 if (omp_get_thread_num() == I) {
   #pragma omp critical
   Count += omp_get_level(); // 6 * 1 = 6
  }
}

Again, it will fail though it must return correct result.

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

At first I didn't, but now the original test case with added critical in the else branch works with full optimization iff I completely remove the specialization CGOpenMPRuntimeNVPTX::emitCriticalRegion.

But again, it just masks the real problem but does not solve it. It is again just a pure coincidence that it returns the expected result.

for (int I = 0; I < 32; ++I) {
 if (omp_get_thread_num() == I) {
   #pragma omp critical
   Count += omp_get_level(); // 6 * 1 = 6
  }
}

Again, it will fail though it must return correct result.

No, this loop also works with full optimization if surrounded by target parallel for map(tofrom: Count) and correctly returns the value 32. If you had different directives in mind, I'd kindly ask you to post a full example that you tested on your end.

I assume you are using the latest CUDA 9.2 with all fixes to the nasty synchronization issues that prior versions of CUDA 9 had?

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

At first I didn't, but now the original test case with added critical in the else branch works with full optimization iff I completely remove the specialization CGOpenMPRuntimeNVPTX::emitCriticalRegion.

But again, it just masks the real problem but does not solve it. It is again just a pure coincidence that it returns the expected result.

for (int I = 0; I < 32; ++I) {
 if (omp_get_thread_num() == I) {
   #pragma omp critical
   Count += omp_get_level(); // 6 * 1 = 6
  }
}

Again, it will fail though it must return correct result.

No, this loop also works with full optimization if surrounded by target parallel for map(tofrom: Count) and correctly returns the value 32. If you had different directives in mind, I'd kindly ask you to post a full example that you tested on your end.

I assume you are using the latest CUDA 9.2 with all fixes to the nasty synchronization issues that prior versions of CUDA 9 had?

The same test with the code I sent at O3 with original critical region returns 67. Cuda V9.2.64

But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use __syncwarp(mask) function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in else branch (Count += omp_get_level(); // 6 * 1 = 6) under control of another #pragma omp critical.

#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6

It must produce the same result as before but it won't, most probably.

I still get the correct results. Do you have a test that you know to fail?

I get Expected count = 67 with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3?

At first I didn't, but now the original test case with added critical in the else branch works with full optimization iff I completely remove the specialization CGOpenMPRuntimeNVPTX::emitCriticalRegion.

But again, it just masks the real problem but does not solve it. It is again just a pure coincidence that it returns the expected result.

for (int I = 0; I < 32; ++I) {
 if (omp_get_thread_num() == I) {
   #pragma omp critical
   Count += omp_get_level(); // 6 * 1 = 6
  }
}

Again, it will fail though it must return correct result.

No, this loop also works with full optimization if surrounded by target parallel for map(tofrom: Count) and correctly returns the value 32. If you had different directives in mind, I'd kindly ask you to post a full example that you tested on your end.

I assume you are using the latest CUDA 9.2 with all fixes to the nasty synchronization issues that prior versions of CUDA 9 had?

The same test with the code I sent at O3 with original critical region returns 67. Cuda V9.2.64

Did not modified the test. Run one more time with original critical region only and a loop with critical region inside: Expected count = 50

ABataev abandoned this revision.Thu, Jul 25, 7:17 AM

Found another solution to fix this test without significant changes in the test itself and to make it work on Cuda8+. Need a little bit reworked D65013 + another one patch with a new runtime function based on D65013.