This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/trunk/libomptarget/deviceRTLs/nvptx/src/
-
trunk/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
-
src/
-
data_sharing.cu
-
omptarget-nvptx.h
-
omptarget-nvptx.cu
-
sync.cu

Differential D56274

[OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.
ClosedPublic

Authored by ABataev on Jan 3 2019, 8:22 AM.

Download Raw Diff

Details

Reviewers

grokos
gtbercea
kkwli0

Commits

rG3c74be80493d: [OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.
rOMP350333: [OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.
rL350333: [OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.

Summary

One of the LLVM optimizations, split critical edges, also clones tail
instructions. This is a dangerous operation for syncthreads()
functions and this transformation leads to undefined behavior or
incorrect results. Patch fixes this problem by replacing syncthreads()
function with the assembler instruction, which cost is too high and
wich cannot be copied.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev created this revision.Jan 3 2019, 8:22 AM

Herald added a subscriber: guansong. · View Herald TranscriptJan 3 2019, 8:22 AM

Harbormaster completed remote builds in B26360: Diff 180086.Jan 3 2019, 8:22 AM

I'll accept the patch for the sake of consistency and correctness of execution. Just one question:

which cost is too high

So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?

This revision is now accepted and ready to land.Jan 3 2019, 8:37 AM

In D56274#1345233, @grokos wrote:

I'll accept the patch for the sake of consistency and correctness of execution. Just one question:

which cost is too high

So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?

No, the cost in LLVM terms is high, but in the end, we end up with absolutely the same code as before. That's why it is marked as NFC.

Closed by commit rL350333: [OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC. (authored by ABataev). · Explain WhyJan 3 2019, 9:47 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJan 3 2019, 9:47 AM

jdoerfert mentioned this in D62199: [OPENMP][NVPTX]Fix barriers and parallel level counters, NFC..May 23 2019, 10:04 PM

Do I understand correctly that the intrinsic is apparently "misoptimized" by LLVM?
If so, would that also be a problem for other CUDA code, for example, User code that uses this intrinsic?

Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2019, 6:17 PM

Is OpenMP not marking all functions as convergent?

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

In D56274#1519738, @ABataev wrote:
In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

In D56274#1519820, @gtbercea wrote:
In D56274#1519738, @ABataev wrote:
In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:
if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)
when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Can you post the starting IR for this?

In D56274#1519820, @gtbercea wrote:
In D56274#1519738, @ABataev wrote:
In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:
if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)
when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Yes, that's the problem.

In D56274#1519835, @arsenm wrote:
In D56274#1519820, @gtbercea wrote:
In D56274#1519738, @ABataev wrote:
In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:
if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)
when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:
Can you post the starting IR for this?

This is the code without optimizations enabled:

If I don't enable optimizations then I get the following code which works correctly and doesn't hang:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end

if.then:
// LLVM IR for A code block
br label %if.end

if.end:
tail call void @llvm.nvvm.barrier0() #6
// LLVM IR for B code block

The optimization that is being applied is called "call site splitting" in LLVM.

In D56274#1519844, @gtbercea wrote:
In D56274#1519835, @arsenm wrote:
In D56274#1519820, @gtbercea wrote:
In D56274#1519738, @ABataev wrote:
In D56274#1519707, @hfinkel wrote:
In D56274#1519687, @ABataev wrote:
In D56274#1519316, @hfinkel wrote:
In D56274#1519296, @arsenm wrote:
In D56274#1519286, @ABataev wrote:
In D56274#1519272, @arsenm wrote:
In D56274#1519176, @ABataev wrote:

In D56274#1519039, @arsenm wrote:

In D56274#1517067, @arsenm wrote:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
+1 to the verifier check. @jlebar , do you agree?
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Is there a public test case? If not, can you share/construct one?
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:
if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)
when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:
Can you post the starting IR for this?
This is the code without optimizations enabled:

If I don't enable optimizations then I get the following code which works correctly and doesn't hang:
entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end

if.then:
// LLVM IR for A code block
br label %if.end

if.end:
tail call void @llvm.nvvm.barrier0() #6
// LLVM IR for B code block

Can you post the complete IR which reproduces this?

I reported several problems to NVIDIA.

Unfortunately, reporting problems to NVIDIA doesn't necessarily cause things to be fixed upstream. If you can (also) report problems occurring upstream to the upstream bug tracker, that will be greatly helpful.

The scheme I'd like to see us follow upstream is, when workaround are required, to leave the proper code in addition to the workarounds. Guard the workarounds with ifdefs so that we can use them for older versions as necessary, but use the proper version when compiling with newer compilers (once the bugs have been fixed). Over time, we'll be able to eliminate workarounds for compilers we no longer support.

D62581

In D56274#1520909, @arsenm wrote:

D62581

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

In D56274#1521762, @grokos wrote:

In D56274#1520909, @arsenm wrote:

D62581

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

In D56274#1521772, @arsenm wrote:

In D56274#1521762, @grokos wrote:

In D56274#1520909, @arsenm wrote:

D62581

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

The asm also isn't necessarily a real workaround. Inline asm call sites need to be marked as convergent as well

In D56274#1521780, @arsenm wrote:

In D56274#1521772, @arsenm wrote:

In D56274#1521762, @grokos wrote:

In D56274#1520909, @arsenm wrote:

D62581

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

The asm also isn't necessarily a real workaround. Inline asm call sites need to be marked as convergent as well

I agree. Plus, we have the same problem with the named barriers. They also are represented as inline asm and also must be marked as convergent.

Revision Contents

Path

Size

openmp/

trunk/

libomptarget/

deviceRTLs/

nvptx/

src/

6 lines

3 lines

9 lines

3 lines

Diff 180092

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/data_sharing.cu

Show First 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_get_team_static_memory(const void *buf, size_t size,
if (is_shared) {		if (is_shared) {
*frame = buf;		*frame = buf;
return;		return;
}		}
if (isSPMDMode()) {		if (isSPMDMode()) {
if (GetThreadIdInBlock() == 0) {		if (GetThreadIdInBlock() == 0) {
*frame = omptarget_nvptx_simpleMemoryManager.Acquire(buf, size);		*frame = omptarget_nvptx_simpleMemoryManager.Acquire(buf, size);
}		}
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();
return;		return;
}		}
ASSERT0(LT_FUSSY, GetThreadIdInBlock() == getMasterThreadId(),		ASSERT0(LT_FUSSY, GetThreadIdInBlock() == getMasterThreadId(),
"Must be called only in the target master thread.");		"Must be called only in the target master thread.");
*frame = omptarget_nvptx_simpleMemoryManager.Acquire(buf, size);		*frame = omptarget_nvptx_simpleMemoryManager.Acquire(buf, size);
__threadfence();		__threadfence();
}		}

EXTERN void __kmpc_restore_team_static_memory(int16_t is_shared) {		EXTERN void __kmpc_restore_team_static_memory(int16_t is_shared) {
if (is_shared)		if (is_shared)
return;		return;
if (isSPMDMode()) {		if (isSPMDMode()) {
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();
if (GetThreadIdInBlock() == 0) {		if (GetThreadIdInBlock() == 0) {
omptarget_nvptx_simpleMemoryManager.Release();		omptarget_nvptx_simpleMemoryManager.Release();
}		}
return;		return;
}		}
__threadfence();		__threadfence();
ASSERT0(LT_FUSSY, GetThreadIdInBlock() == getMasterThreadId(),		ASSERT0(LT_FUSSY, GetThreadIdInBlock() == getMasterThreadId(),
"Must be called only in the target master thread.");		"Must be called only in the target master thread.");
omptarget_nvptx_simpleMemoryManager.Release();		omptarget_nvptx_simpleMemoryManager.Release();
}		}

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	#else			#else
	#define __SHFL_SYNC(mask, var, srcLane) __shfl((var), (srcLane))			#define __SHFL_SYNC(mask, var, srcLane) __shfl((var), (srcLane))
	#define __SHFL_DOWN_SYNC(mask, var, delta, width) \			#define __SHFL_DOWN_SYNC(mask, var, delta, width) \
	__shfl_down((var), (delta), (width))			__shfl_down((var), (delta), (width))
	#define __BALLOT_SYNC(mask, predicate) __ballot((predicate))			#define __BALLOT_SYNC(mask, predicate) __ballot((predicate))
	#define __ACTIVEMASK() __ballot(1)			#define __ACTIVEMASK() __ballot(1)
	#endif			#endif

				#define __SYNCTHREADS_N(n) asm volatile("bar.sync %0;" : : "r"(n) : "memory");
				#define __SYNCTHREADS() __SYNCTHREADS_N(0)

	// arguments needed for L0 parallelism only.			// arguments needed for L0 parallelism only.
	class omptarget_nvptx_SharedArgs {			class omptarget_nvptx_SharedArgs {
	public:			public:
	// All these methods must be called by the master thread only.			// All these methods must be called by the master thread only.
	INLINE void Init() {			INLINE void Init() {
	args = buffer;			args = buffer;
	nArgs = MAX_SHARED_ARGS;			nArgs = MAX_SHARED_ARGS;
	}			}
	▲ Show 20 Lines • Show All 399 Lines • Show Last 20 Lines

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	if (!RequiresOMPRuntime) {
// If OMP runtime is not required don't initialize OMP state.		// If OMP runtime is not required don't initialize OMP state.
setExecutionParameters(Spmd, RuntimeUninitialized);		setExecutionParameters(Spmd, RuntimeUninitialized);
if (GetThreadIdInBlock() == 0) {		if (GetThreadIdInBlock() == 0) {
int slot = smid() % MAX_SM;		int slot = smid() % MAX_SM;
usedSlotIdx = slot;		usedSlotIdx = slot;
omptarget_nvptx_simpleThreadPrivateContext =		omptarget_nvptx_simpleThreadPrivateContext =
omptarget_nvptx_device_simpleState[slot].Dequeue();		omptarget_nvptx_device_simpleState[slot].Dequeue();
}		}
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();
omptarget_nvptx_simpleThreadPrivateContext->Init();		omptarget_nvptx_simpleThreadPrivateContext->Init();
return;		return;
}		}
setExecutionParameters(Spmd, RuntimeInitialized);		setExecutionParameters(Spmd, RuntimeInitialized);

//		//
// Team Context Initialization.		// Team Context Initialization.
//		//
// In SPMD mode there is no master thread so use any cuda thread for team		// In SPMD mode there is no master thread so use any cuda thread for team
// context initialization.		// context initialization.
int threadId = GetThreadIdInBlock();		int threadId = GetThreadIdInBlock();
if (threadId == 0) {		if (threadId == 0) {
// Get a state object from the queue.		// Get a state object from the queue.
int slot = smid() % MAX_SM;		int slot = smid() % MAX_SM;
usedSlotIdx = slot;		usedSlotIdx = slot;
omptarget_nvptx_threadPrivateContext =		omptarget_nvptx_threadPrivateContext =
omptarget_nvptx_device_State[slot].Dequeue();		omptarget_nvptx_device_State[slot].Dequeue();

omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();		omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();
omptarget_nvptx_WorkDescr &workDescr = getMyWorkDescriptor();		omptarget_nvptx_WorkDescr &workDescr = getMyWorkDescriptor();
// init team context		// init team context
currTeamDescr.InitTeamDescr();		currTeamDescr.InitTeamDescr();
}		}
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();

omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();		omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();
omptarget_nvptx_WorkDescr &workDescr = getMyWorkDescriptor();		omptarget_nvptx_WorkDescr &workDescr = getMyWorkDescriptor();

//		//
// Initialize task descr for each thread.		// Initialize task descr for each thread.
//		//
omptarget_nvptx_TaskDescr *newTaskDescr =		omptarget_nvptx_TaskDescr *newTaskDescr =
Show All 24 Lines

EXTERN __attribute__((deprecated)) void __kmpc_spmd_kernel_deinit() {		EXTERN __attribute__((deprecated)) void __kmpc_spmd_kernel_deinit() {
__kmpc_spmd_kernel_deinit_v2(isRuntimeInitialized());		__kmpc_spmd_kernel_deinit_v2(isRuntimeInitialized());
}		}

EXTERN void __kmpc_spmd_kernel_deinit_v2(int16_t RequiresOMPRuntime) {		EXTERN void __kmpc_spmd_kernel_deinit_v2(int16_t RequiresOMPRuntime) {
// We're not going to pop the task descr stack of each thread since		// We're not going to pop the task descr stack of each thread since
// there are no more parallel regions in SPMD mode.		// there are no more parallel regions in SPMD mode.
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();
int threadId = GetThreadIdInBlock();		int threadId = GetThreadIdInBlock();
if (!RequiresOMPRuntime) {		if (!RequiresOMPRuntime) {
if (threadId == 0) {		if (threadId == 0) {
// Enqueue omp state object for use by another team.		// Enqueue omp state object for use by another team.
int slot = usedSlotIdx;		int slot = usedSlotIdx;
omptarget_nvptx_device_simpleState[slot].Enqueue(		omptarget_nvptx_device_simpleState[slot].Enqueue(
omptarget_nvptx_simpleThreadPrivateContext);		omptarget_nvptx_simpleThreadPrivateContext);
}		}
Show All 15 Lines

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/sync.cu

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	if (checkRuntimeUninitialized(loc_ref)) {
PRINT0(LD_SYNC, "completed kmpc_barrier\n");		PRINT0(LD_SYNC, "completed kmpc_barrier\n");
}		}
}		}

// Emit a simple barrier call in SPMD mode. Assumes the caller is in an L0		// Emit a simple barrier call in SPMD mode. Assumes the caller is in an L0
// parallel region and that all worker threads participate.		// parallel region and that all worker threads participate.
EXTERN void __kmpc_barrier_simple_spmd(kmp_Ident *loc_ref, int32_t tid) {		EXTERN void __kmpc_barrier_simple_spmd(kmp_Ident *loc_ref, int32_t tid) {
PRINT0(LD_SYNC, "call kmpc_barrier_simple_spmd\n");		PRINT0(LD_SYNC, "call kmpc_barrier_simple_spmd\n");
__syncthreads();		// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
		__SYNCTHREADS();
PRINT0(LD_SYNC, "completed kmpc_barrier_simple_spmd\n");		PRINT0(LD_SYNC, "completed kmpc_barrier_simple_spmd\n");
}		}

// Emit a simple barrier call in Generic mode. Assumes the caller is in an L0		// Emit a simple barrier call in Generic mode. Assumes the caller is in an L0
// parallel region and that all worker threads participate.		// parallel region and that all worker threads participate.
EXTERN void __kmpc_barrier_simple_generic(kmp_Ident *loc_ref, int32_t tid) {		EXTERN void __kmpc_barrier_simple_generic(kmp_Ident *loc_ref, int32_t tid) {
int numberOfActiveOMPThreads = GetNumberOfThreadsInBlock() - WARPSIZE;		int numberOfActiveOMPThreads = GetNumberOfThreadsInBlock() - WARPSIZE;
// The #threads parameter must be rounded up to the WARPSIZE.		// The #threads parameter must be rounded up to the WARPSIZE.
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines