Page MenuHomePhabricator

[OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.
ClosedPublic

Authored by ABataev on Jan 3 2019, 8:22 AM.

Details

Summary

One of the LLVM optimizations, split critical edges, also clones tail
instructions. This is a dangerous operation for syncthreads()
functions and this transformation leads to undefined behavior or
incorrect results. Patch fixes this problem by replacing
syncthreads()
function with the assembler instruction, which cost is too high and
wich cannot be copied.

Diff Detail

Repository
rL LLVM

Event Timeline

ABataev created this revision.Jan 3 2019, 8:22 AM
grokos accepted this revision.Jan 3 2019, 8:37 AM

I'll accept the patch for the sake of consistency and correctness of execution. Just one question:

which cost is too high

So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?

This revision is now accepted and ready to land.Jan 3 2019, 8:37 AM

I'll accept the patch for the sake of consistency and correctness of execution. Just one question:

which cost is too high

So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?

No, the cost in LLVM terms is high, but in the end, we end up with absolutely the same code as before. That's why it is marked as NFC.

This revision was automatically updated to reflect the committed changes.

Do I understand correctly that the intrinsic is apparently "misoptimized" by LLVM?
If so, would that also be a problem for other CUDA code, for example, User code that uses this intrinsic?

Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2019, 6:17 PM
arsenm added a subscriber: arsenm.May 24 2019, 6:26 PM

Is OpenMP not marking all functions as convergent?

Is OpenMP not marking all functions as convergent?

ping

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Can you post the starting IR for this?

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Yes, that's the problem.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Can you post the starting IR for this?

This is the code without optimizations enabled:

If I don't enable optimizations then I get the following code which works correctly and doesn't hang:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end

if.then:
// LLVM IR for A code block
br label %if.end

if.end:
tail call void @llvm.nvvm.barrier0() #6
// LLVM IR for B code block

The optimization that is being applied is called "call site splitting" in LLVM.

Is OpenMP not marking all functions as convergent?

ping

Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.

I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:

__kmpc_barrier is declared as convergent, but the callers are not:

declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone }
attributes #2 = { argmemonly nounwind }

*All* functions need to be assumed convergent, not just the convergent barrier leafs.

The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.

I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked

+1 to the verifier check. @jlebar , do you agree?

If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.

Is there a public test case? If not, can you share/construct one?

Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.

I reported several problems to NVIDIA. Is the problem below the one you're referring to?

For the following CUDA code:

if (threadIdx.x == 0) {
// do some initialization (A)
}
__synchtreads();
// some code (B)

when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end.split

if.end.split:
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.then:
// LLVM IR for A code block
tail call void @llvm.nvvm.barrier0()#6
// LLVM IR for B code block
br label %if.end

if.end:

Can you post the starting IR for this?

This is the code without optimizations enabled:

If I don't enable optimizations then I get the following code which works correctly and doesn't hang:

entry:
%0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
%cmp.i2 = icmp eq i32 %0, 0
br i1 %cmp.i2, label %if.then, label %if.end

if.then:
// LLVM IR for A code block
br label %if.end

if.end:
tail call void @llvm.nvvm.barrier0() #6
// LLVM IR for B code block

Can you post the complete IR which reproduces this?

I reported several problems to NVIDIA.

Unfortunately, reporting problems to NVIDIA doesn't necessarily cause things to be fixed upstream. If you can (also) report problems occurring upstream to the upstream bug tracker, that will be greatly helpful.

The scheme I'd like to see us follow upstream is, when workaround are required, to leave the proper code in addition to the workarounds. Guard the workarounds with ifdefs so that we can use them for older versions as necessary, but use the proper version when compiling with newer compilers (once the bugs have been fixed). Over time, we'll be able to eliminate workarounds for compilers we no longer support.

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

The asm also isn't necessarily a real workaround. Inline asm call sites need to be marked as convergent as well

Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?

Yes

The asm also isn't necessarily a real workaround. Inline asm call sites need to be marked as convergent as well

I agree. Plus, we have the same problem with the named barriers. They also are represented as inline asm and also must be marked as convergent.